Fastest way to find out if two sets overlap? - c++

Obviously doing std::set_intersection() is a waste of time.
Isn't there a function in the algorithm header for doing exactly this?
std::find_first_of() is doing a linear search as far as I understand.

This is a solution only for std::set (or multi). A solution for map would require only a bit more work.
I try it 3 ways.
First, if one is far larger than the other, I simply look for all of the elements of one in the other. Then vice versa.
The constant 100 is wrong theoretically. It should be k n lg m > m for some k, not 100 n > m for optimal big-O performance: but the constant factor is large, and 100>lg m, so really one should experiment.
If that isn't the case, we walk through each collection looking for collisions much like set_intersection. Instead of just ++, we use .lower_bound to try to skip through each list faster.
Note that if your list consists of interleaved elements (like {1,3,7} and {0,2,4,6,8}) this will be slower than just ++ by a logarithmic factor.
If the two sets "cross" each other less often, this can skip over large amounts of each set's contents.
Replace the lower_bound portion with a mere ++ if you want to compare the two behaviors.
template<class Lhs, class Rhs>
bool sorted_has_overlap( Lhs const& lhs, Rhs const& rhs ) {
if (lhs.empty() || rhs.empty()) return false;
if (lhs.size() * 100 < rhs.size()) {
for (auto&& x:lhs)
if (rhs.find(x)!=rhs.end())
return true;
return false;
}
if (rhs.size() * 100 < lhs.size()) {
for(auto&& x:rhs)
if (lhs.find(x)!=lhs.end())
return true;
return false;
}
using std::begin; using std::end;
auto lit = begin(lhs);
auto lend = end(lhs);
auto rit = begin(rhs);
auto rend = end(rhs);
while( lit != lend && rit != rend ) {
if (*lit < *rit) {
lit = lhs.lower_bound(*rit);
continue;
}
if (*rit < *lit) {
rit = rhs.lower_bound(*lit);
continue;
}
return true;
}
return false;
}
a sorted array could do the 3rd choice of algorithm and use std::lower_bound to do fast advance of the "other" container. This has the advantage of using partial searches (which you cannot do fast in a set). It will also behave poorly on "interleaved" elements (by a log n factor) compared to naive ++.
The first two can also be done fast with sorted arrays, replacing method calls with calls to algorithms in std. Such a transformation is basically mechanical.
An asymptotically optimal version on a sorted array would use a binary search biased towards finding lower bounds at the start of the list -- search at 1, 2, 4, 8, etc instead of at half, quarters, etc. Note that this has the same lg(n) worst case, but is O(1) if the searched for element is first instead of O(lg(n)). As that case (where the search advances less) means less global progress is made, optimizing the sub-algorithm for that case gives you a better global worst-case speed.
To get why, on "fast alternation" it wouldn't perform any worse than ++ -- the case where the next element is a sign swap takes O(1) operations, and it replaces O(k) with O(lg k) if the gap is larger.
However, by this point we are far, far down an optimization hole: profile, and determine if it is worth it before proceeding this way.
Another approach on sorted arrays is to presume that std::lower_bound is written optimally (on random access iterators). Use an output iterator that throws an exception if written to. Return true iff you catch that exception, false otherwise.
(the optimizations above -- pick one and bin search the other, and exponential advance searching -- may be legal for a std::set_intersection.)
I think the use of 3 algorithms is important. Set intersection testing where one side is much smaller that the other is probably common: the extreme case of one element on one side, and many on the other is very well known (as a search).
A naive 'double linear' search gives you up to linear performance in that common case. By detecting the assymmetry between sides, you can switch over to 'linear in small, log in large' at an opportune point, and have the much better performance in those cases. O(n+m) vs O(m lg n) -- if m < O(n/lg n) the second beats the first. If m is a constant, then we get O(n) vs O(lg n) -- which includes the edge case of 'use function to find if a single element is in some large collection'.

You can use the following template function if the inputs are sorted:
template<class InputIt1, class InputIt2>
bool intersect(InputIt1 first1, InputIt1 last1, InputIt2 first2, InputIt2 last2)
{
while (first1 != last1 && first2 != last2) {
if (*first1 < *first2) {
++first1;
continue;
}
if (*first2 < *first1) {
++first2;
continue;
}
return true;
}
return false;
}
You can use like this:
#include <iostream>
int main() {
int a[] = {1, 2, 3};
int b[] = {3, 4};
int c[] = {4};
std::cout << intersect(a, a + 3, b, b + 2) << std::endl;
std::cout << intersect(b, b + 2, c, c + 1) << std::endl;
std::cout << intersect(a, a + 3, c, c + 1) << std::endl;
}
Result:
1
1
0
This function has complexity O(n + m) where n, m are input sizes. But if one input is very small to the other (n << m for example), it's better to check each of the n elements with binary search if it belongs to the other input. This gives O(n * log(m)) time.
#include <algorithm>
template<class InputIt1, class InputIt2>
/**
* When input1 is much smaller that input2
*/
bool intersect(InputIt1 first1, InputIt1 last1, InputIt2 first2, InputIt2 last2) {
while (first1 != last1)
if (std::binary_search(first2, last2, *first1++))
return true;
return false;
}

Sometimes you can encode sets of numbers in a single memory word. For example, you could encode the set {0,2,3,6,7} in the memory word: ...00000011001101. The rule is: the bit at position i (reading from right to left) is up, if and only if the number i is in the set.
Now if you have two sets, encoded in the memory words a and b, you can perform the intersection using the bitwise operator &.
int a = ...;
int b = ...;
int intersection = a & b;
int union = a | b; // bonus
The good thing of this style, is that the intersection ( union, complementation ) is performed in one cpu instruction (I don't know if this is the correct term).
You could use more than one memory word, if you need to handle numbers that are greater than the number of bits of a memory word. Normally, I use an array of memory words.
If you want handle negative numbers, just use two arrays, one for negative numbers, and one for positive numbers.
The bad thing of this method, is that it works only with integers.

I think you can make a binary_search
#include <set>
#include <iostream>
#include <algorithm>
bool overlap(const std::set<int>& s1, const std::set<int>& s2)
{
for( const auto& i : s1) {
if(std::binary_search(s2.begin(), s2.end(), i))
return true;
}
return false;
}
int main()
{
std::set<int> s1 {1, 2, 3};
std::set<int> s2 {3, 4, 5, 6};
std::cout << overlap(s1, s2) << '\n';
}

Related

Find out in linear time whether there is a pair in sorted vector that adds up to certain value

Given an std::vector of distinct elements sorted in ascending order, I want to develop an algorithm that determines whether there are two elements in the collection whose sum is a certain value, sum.
I've tried two different approaches with their respective trade-offs:
I can scan the whole vector and, for each element in the vector, apply binary search (std::lower_bound) on the vector for searching an element corresponding to the difference between sum and the current element. This is an O(n log n) time solution that requires no additional space.
I can traverse the whole vector and populate an std::unordered_set. Then, I scan the vector and, for each element, I look up in the std::unordered_set for the difference between sum and the current element. Since searching on a hash table runs in constant time on average, this solution runs in linear time. However, this solution requires additional linear space because of the std::unordered_set data structure.
Nevertheless, I'm looking for a solution that runs in linear time and requires no additional linear space. Any ideas? It seems that I'm forced to trade speed for space.
As the std::vector is already sorted and you can calculate the sum of a pair on the fly, you can achieve a linear time solution in the size of the vector with O(1) space.
The following is an STL-like implementation that requires no additional space and runs in linear time:
template<typename BidirIt, typename T>
bool has_pair_sum(BidirIt first, BidirIt last, T sum) {
if (first == last)
return false; // empty range
for (--last; first != last;) {
if ((*first + *last) == sum)
return true; // pair found
if ((*first + *last) > sum)
--last; // decrease pair sum
else // (*first + *last) < sum (trichotomy)
++first; // increase pair sum
}
return false;
}
The idea is to traverse the vector from both ends – front and back – in opposite directions at the same time and calculate the sum of the pair of elements while doing so.
At the very beginning, the pair consists of the elements with the lowest and the highest values, respectively. If the resulting sum is lower than sum, then advance first – the iterator pointing at the left end. Otherwise, move last – the iterator pointing at the right end – backward. This way, the resulting sum progressively approaches to sum. If both iterators end up pointing at the same element and no pair whose sum is equal to sum has been found, then there is no such a pair.
auto main() -> int {
std::vector<int> vec{1, 3, 4, 7, 11, 13, 17};
std::cout << has_pair_sum(vec.begin(), vec.end(), 2) << ' ';
std::cout << has_pair_sum(vec.begin(), vec.end(), 7) << ' ';
std::cout << has_pair_sum(vec.begin(), vec.end(), 19) << ' ';
std::cout << has_pair_sum(vec.begin(), vec.end(), 30) << '\n';
}
The output is:
0 1 0 1
Thanks to the generic nature of the function template has_pair_sum() and since it just requires bidirectional iterators, this solution works with std::list as well:
std::list<int> lst{1, 3, 4, 7, 11, 13, 17};
has_pair_sum(lst.begin(), lst.end(), 2);
I had the same idea as the one in the answer of 眠りネロク, but with a little bit more comprehensible implementation.
bool has_pair_sum(std::vector<int> v, int sum){
if(v.empty())
return false;
std::vector<int>::iterator p1 = v.begin();
std::vector<int>::iterator p2 = v.end(); // points to the End(Null-terminator), after the last element
p2--; // Now it points to the last element.
while(p1 != p2){
if(*p1 + *p2 == sum)
return true;
else if(*p1 + *p2 < sum){
p1++;
}else{
p2--;
}
}
return false;
}
well, since we are already given sorted array, we can do it with two pointer approach, we first keep a left pointer at start of the array and a right pointer at end of array, then in each iteration we check if sum of value of left pointer index and value of right pointer index is equal or not , if yes, return from here, otherwise we have to decide how to reduce the boundary, that is either increase left pointer or decrease right pointer, so we compare the temporary sum with given sum and if this temporary sum is greater than the given sum then we decide to reduce the right pointer, if we increase left pointer the temporary sum will remain same or only increase but never lesser, so we decide to reduce the right pointer so that temporary sum decrease and we reach near our given sum, similary if temporary sum is less than given sum, so no meaning of reducing the right pointer as temporary sum will either remain sum or decrease more but never increase so we increase our left pointer so our temporary sum increase and we reach near given sum, and we do the same process again and again unless we get the equal sum or left pointer index value becomes greater than right right pointer index or vice versa
below is the code for demonstration, let me know if something is not clear
bool pairSumExists(vector<int> &a, int &sum){
if(a.empty())
return false;
int len = a.size();
int left_pointer = 0 , right_pointer = len - 1;
while(left_pointer < right_pointer){
if(a[left_pointer] + a[right_pointer] == sum){
return true;
}
if(a[left_pointer] + a[right_pointer] > sum){
--right_pointer;
}
else
if(a[left_pointer] + a[right_poitner] < sum){
++left_pointer;
}
}
return false;
}

What is the most efficient way of copying elements that occur only once in a std vector?

I have a std vector with elements like this:
[0 , 1 , 2 , 0 , 2 , 1 , 0 , 0 , 188 , 220 , 0 , 1 , 2 ]
What is the most efficient way to find and copy the elements that occur only once in this vector, excluding the brute force O(n^2) algorithm? In this case the new list should contain [188, 220]
Make an unordered_map<DataType, Count> count;
Iterate over the input vector increasing count of each value. Sort of count[value]++;
Iterate over the count map copying keys for which value is 1.
It's O(n). You have hashes so for small data sets normal map might be more efficient, but technically it would be O(n log n).
It's a good method for discrete data sets.
Code example:
#include <iostream>
#include <unordered_map>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
vector<int> v{1,1,2,3,3,4};
unordered_map<int,int> count;
for (const auto& e : v) count[e]++;
vector<int> once;
for (const auto& e : count) if(e.second == 1) once.push_back(e.first);
for (const auto& e : once) cout << e << '\n';
return 0;
}
I have tried few ideas. But I don't see a way around map. unordered_multiset is almost a great way... except it does not allow you to iterate over keys. It has a method to check for count of key, but you would need another set just for keys to probe. I don't see it as a simpler way. In modern c++ with autos counting is easy. I've also looked through algorithm library, but I haven't found any transfrom, copy_if, generate, etc. which could conditionally transform an element (map entry -> value if count is 1).
There are very few universally-optimal algorithms. Which algorithm works best usually depends upon the properties of the data that's being processed. Removing duplicates is one such example.
Is v small and filled mostly with unique values?
auto lo = v.begin(), hi = v.end();
std::sort(lo, hi);
while (lo != v.end()) {
hi = std::mismatch(lo + 1, v.end(), lo).first;
lo = (std::distance(lo, hi) == 1) ? hi : v.erase(lo, hi);
}
Is v small and filled mostly with duplicates?
auto lo = v.begin(), hi = v.end();
std::sort(lo, hi);
while (lo != v.end()) {
hi = std::upper_bound(lo + 1, v.end(), *lo);
lo = (std::distance(lo, hi) == 1) ? hi : v.erase(lo, hi);
}
Is v gigantic?
std::unordered_map<int, bool> keyUniqueness{};
keyUniqueness.reserve(v.size());
for (int key : v) {
bool wasMissing = keyUniqueness.find(key) == keyUniqueness.end();
keyUniqueness[key] = wasMissing;
}
v.clear();
for (const auto& element : keyUniqueness) {
if (element.second) { v.push_back(element.first); }
}
And so on.
#luk32's answer is definitely the most time efficient way of solving this question. However, if you are short on memory and can't afford an unordered_map, there are other ways of doing it.
You can use std::sort() to sort the vector first. Then the non-duplicates can be found in one iteration. Overall complexity being O(nlogn).
If the question is slightly different, and you know there is only one non-duplicate element, you can use this code (code in Java). The conplexity here is O(n).
Since you use a std::vector, I presume you want to maximize all its benefits including reference locality. In order to do that, we need a bit of typing here. And I benchmarked the code below...
I have a linear O(n) algorithm here (effectively O(nlog(n))), its a bit like brian's answer, but I use OutputIterators instead of doing it in-place. The pre-condition is that it's sorted.
template<typename InputIterator, typename OutputIterator>
OutputIterator single_unique_copy(InputIterator first, InputIterator last, OutputIterator result){
auto previous = first;
if(previous == last || ++first == last) return result;
while(true){
if(*first == *previous)
while((++first != last) && (*first == *previous));
else
*(result++) = *previous;
if(first == last) break;
previous = first;
++first;
}
return ++result;
}
And here is a sample usage:
int main(){
std::vector<int> vm = {0, 1, 2, 0, 2, 1, 0, 0, 1, 88, 220, 0, 1, 2, 227, -8};
std::vector<int> kk;
std::sort(vm.begin(), vm.end());
single_unique_copy(vm.begin(), vm.end(), std::back_inserter(kk));
for(auto x : kk) std::cout << x << ' ';
return 0;
}
As expected, the output is:
-8, 88, 220, 227
Your use case may be different from mine, so, profile first... :-)
EDIT:
Using luk32's algorithm and mine... Using 13 million elements...
created in descending order, repeated at every i % 5.
Under debug build, luk32: 9.34seconds and mine: 7.80seconds
Under -O3, luk32: 2.71seconds and mine 0.52seconds
Mingw5.1 64bit, Windows10, 1.73Ghz Core i5 4210U, 6GB DDR3 1600Mhz RAM
Benchmark here, http://coliru.stacked-crooked.com/a/187e5e3841439742
For smaller numbers, the difference still holds, until it becomes a non-critical code

Algorithm for finding the number which appears the most in a row - C++

I need a help in making an algorithm for solving one problem: There is a row with numbers which appear different times in the row, and i need to find the number that appears the most and how many times it's in the row, ex:
1-1-5-1-3-7-2-1-8-9-1-2
That would be 1 and it appears 5 times.
The algorithm should be fast (that's my problem).
Any ideas ?
What you're looking for is called the mode. You can sort the array, then look for the longest repeating sequence.
You could keep hash table and store a count of every element in that structure, like this
h[1] = 5
h[5] = 1
...
You can't get it any faster than in linear time, as you need to at least look at each number once.
If you know that the numbers are in a certain range, you can use an additional array to sum up the occurrences of each number, otherwise you'd need a hashtable, which is slightly slower.
Both of these need additional space though and you need to loop through the counts again in the end to get the result.
Unless you really have a huge amount of numbers and absolutely require O(n) runtime, you could simply sort your array of numbers. Then you can walk once through the numbers and simply keep the count of the current number and the number with the maximum of occurences in two variables. So you save yourself a lot of space, tradeing it off with a little bit of time.
There is an algorithm that solves your problem in linear time (linear in the number of items in the input). The idea is to use a hash table to associate to each value in the input a count indicating the number of times that value has been seen. You will have to profile against your expected input and see if this meets your needs.
Please note that this uses O(n) extra space. If this is not acceptable, you might want to consider sorting the input as others have proposed. That solution will be O(n log n) in time and O(1) in space.
Here's an implementation in C++ using std::tr1::unordered_map:
#include <iostream>
#include <unordered_map>
using namespace std;
using namespace std::tr1;
typedef std::tr1::unordered_map<int, int> map;
int main() {
map m;
int a[12] = {1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2};
for(int i = 0; i < 12; i++) {
int key = a[i];
map::iterator it = m.find(key);
if(it == m.end()) {
m.insert(map::value_type(key, 1));
}
else {
it->second++;
}
}
int count = 0;
int value;
for(map::iterator it = m.begin(); it != m.end(); it++) {
if(it->second > count) {
count = it->second;
value = it->first;
}
}
cout << "Value: " << value << endl;
cout << "Count: " << count << endl;
}
The algorithm works using the input integers as keys in a hashtable to a count of the number of times each integer appears. Thus the key (pun intended) to the algorithm is building this hash table:
int key = a[i];
map::iterator it = m.find(key);
if(it == m.end()) {
m.insert(map::value_type(key, 1));
}
else {
it->second++;
}
So here we are looking at the ith element in our input list. Then what we do is we look to see if we've already seen it. If we haven't, we add a new value to our hash table containing this new integer, and an initial count of one indicating this is our first time seeing it. Otherwise, we increment the counter associated to this value.
Once we have built this table, it's simply a matter of running through the values to find one that appears the most:
int count = 0;
int value;
for(map::iterator it = m.begin(); it != m.end(); it++) {
if(it->second > count) {
count = it->second;
value = it->first;
}
}
Currently there is no logic to handle the case of two distinct values appearing the same number of times and that number of times being the largest amongst all the values. You can handle that yourself depending on your needs.
Here is a simple one, that is O(n log n):
Sort the vector # O(n log n)
Create vars: int MOST, VAL, CURRENT
for ELEMENT in LIST:
CURRENT += 1
if CURRENT >= MOST:
MOST = CURRENT
VAL = ELEMENT
return (VAL, MOST)
There are few methods:
Universal method is "sort it and find longest subsequence" which is O(nlog n). The fastest sort algorithm is quicksort ( average, the worst is O( n^2 ) ). Also you can use heapsort but it is quite slower in average case but asymptotic complexity is O( n log n ) also in the worst case.
If you have some information about numbers then you can use some tricks. If numbers are from the limited range then you can use part of algorithm for counting sort. It is O( n ).
If this isn't your case, there are some other sort algorithms which can do it in linear time but no one is universal.
The best time complexity you can get here is O(n). You have to look through all elements, because the last element may be the one which determines the mode.
The solution depends on whether time or space is more important.
If space is more important, then you can sort the list then find the longest sequence of consecutive elements.
If time is more important, you can iterate through the list, keeping a count of the number of occurences of each element (e.g. hashing element -> count). While doing this, keep track of the element with max count, switching if necessary.
If you also happen know that the mode is the majority element (i.e. there are more than n/2 elements in the array with this value), then you can get O(n) speed and O(1) space efficiency.
Generic C++ solution:
#include <algorithm>
#include <iterator>
#include <map>
#include <utility>
template<class T, class U>
struct less_second
{
bool operator()(const std::pair<T, U>& x, const std::pair<T, U>& y)
{
return x.second < y.second;
}
};
template<class Iterator>
std::pair<typename std::iterator_traits<Iterator>::value_type, int>
most_frequent(Iterator begin, Iterator end)
{
typedef typename std::iterator_traits<Iterator>::value_type vt;
std::map<vt, int> frequency;
for (; begin != end; ++begin) ++frequency[*begin];
return *std::max_element(frequency.begin(), frequency.end(),
less_second<vt, int>());
}
#include <iostream>
int main()
{
int array[] = {1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2};
std::pair<int, int> result = most_frequent(array, array + 12);
std::cout << result.first << " appears " << result.second << " times.\n";
}
Haskell solution:
import qualified Data.Map as Map
import Data.List (maximumBy)
import Data.Function (on)
count = foldl step Map.empty where
step frequency x = Map.alter next x frequency
next Nothing = Just 1
next (Just n) = Just (n+1)
most_frequent = maximumBy (compare `on` snd) . Map.toList . count
example = most_frequent [1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2]
Shorter Haskell solution, with help from stack overflow:
import qualified Data.Map as Map
import Data.List (maximumBy)
import Data.Function (on)
most_frequent = maximumBy (compare `on` snd) . Map.toList .
Map.fromListWith (+) . flip zip (repeat 1)
example = most_frequent [1, 1, 5, 1, 3, 7, 2, 1, 8, 9, 1, 2]
The solution below gives you the count of each number. It is a better approach than using map in terms of time and space. If you need to get the number that appeared most number of times, then this is not better than previous ones.
EDIT: This approach is useful for unsigned numbers only and the numbers starting from 1.
std::string row = "1,1,5,1,3,7,2,1,8,9,1,2";
const unsigned size = row.size();
int* arr = new int[size];
memset(arr, 0, size*sizeof(int));
for (int i = 0; i < size; i++)
{
if (row[i] != ',')
{
int val = row[i] - '0';
arr[val - 1]++;
}
}
for (int i = 0; i < size; i++)
std::cout << i + 1 << "-->" << arr[i] << std::endl;
Since this is homework I think it's OK to supply a solution in a different language.
In Smalltalk something like the following would be a good starting point:
SequenceableCollection>>mode
| aBag maxCount mode |
aBag := Bag new
addAll: self;
yourself.
aBag valuesAndCountsDo: [ :val :count |
(maxCount isNil or: [ count > maxCount ])
ifTrue: [ mode := val.
maxCount := count ]].
^mode
As time is going by, the language evolves.
We have now many more language constructs that make life simpler
namespace aliases
CTAD (Class Template Argument Deduction)
more modern containers like std::unordered_map
range based for loops
the std::ranges library
projections
using statment
structured bindings
more modern algorithms
We could now come up with the following code:
#include <iostream>
#include <vector>
#include <unordered_map>
#include <algorithm>
namespace rng = std::ranges;
int main() {
// Demo data
std::vector data{ 2, 456, 34, 3456, 2, 435, 2, 456, 2 };
// Count values
using Counter = std::unordered_map<decltype (data)::value_type, std::size_t> ;
Counter counter{}; for (const auto& d : data) counter[d]++;
// Get max
const auto& [value, count] = *rng::max_element(counter, {}, &Counter::value_type::second);
// Show output
std::cout << '\n' << value << " found " << count << " times\n";
}

Find largest and second largest element in a range

How do I find the above without removing the largest element and searching again? Is there a more efficient way to do this? It does not matter if the these elements are duplicates.
for (e: all elements) {
if (e > largest) {
second = largest;
largest = e;
} else if (e > second) {
second = e;
}
}
You could either initialize largest and second to an appropriate lower bound, or to the first two items in the list (check which one is bigger, and don't forget to check if the list has at least two items)
using partial_sort ?
std::partial_sort(aTest.begin(), aTest.begin() + 2, aTest.end(), Functor);
An Example:
std::vector<int> aTest;
aTest.push_back(3);
aTest.push_back(2);
aTest.push_back(4);
aTest.push_back(1);
std::partial_sort(aTest.begin(), aTest.begin()+2,aTest.end(), std::greater<int>());
int Max = aTest[0];
int SecMax = aTest[1];
nth_element(begin, begin+n,end,Compare) places the element that would be nth (where "first" is "0th") if the range [begin, end) were sorted at position begin+n and makes sure that everything from [begin,begin+n) would appear before the nth element in the sorted list. So the code you want is:
nth_element(container.begin(),
container.begin()+1,
container.end(),
appropriateCompare);
This will work well in your case, since you're only looking for the two largest. Assuming your appropriateCompare sorts things from largest to smallest, the second largest element with be at position 1 and the largest will be at position 0.
Lets assume you mean to find the two largest unique values in the list.
If the list is already sorted, then just look at the second last element (or rather, iterate from the end looking for the second last value).
If the list is unsorted, then don't bother to sort it. Sorting is at best O(n lg n). Simple linear iteration is O(n), so just loop over the elements keeping track:
v::value_type second_best = 0, best = 0;
for(v::const_iterator i=v.begin(); i!=v.end(); ++i)
if(*i > best) {
second_best = best;
best = *i;
} else if(*i > second_best) {
second_best = *i;
}
There are of course other criteria, and these could all be put into the test inside the loop. However, should you mean that two elements that both have the same largest value should be found, you have to consider what happens should three or more elements all have this largest value, or if two or more elements have the second largest.
The optimal algorithm shouldn't need more than 1.5 * N - 2 comparisons. (Once we've decided that it's O(n), what's the coefficient in front of N? 2 * N comparisons is less than optimal).
So, first determine the "winner" and the "loser" in each pair - that's 0.5 * N comparisons.
Then determine the largest element by comparing winners - that's another 0.5 * N - 1 comparisons.
Then determine the second-largest element by comparing the loser of the pair where the largest element came from against the winners of all other pairs - another 0.5 * N - 1 comparisons.
Total comparisons = 1.5 N - 2.
The answer depends if you just want the values, or also iterators pointing at the values.
Minor modification of #will answer.
v::value_type second_best = 0, best = 0;
for(v::const_iterator i=v.begin(); i!=v.end(); ++i)
{
if(*i > best)
{
second_best = best;
best = *i;
}
else if (*i > second_best)
{
second_best = *i;
}
}
Create a sublist from n..m, sort it descending. Then grab the first two elements. Delete these elements from the orginal list.
You can scan the list in one pass and save the 1st and 2nd values, that has a O(n) efficiency while sorting is O(n log n).
EDIT:
I think that partial sort is O(n log k)
Untested but fun:
template <typename T, int n>
class top_n_functor : public unary_function<T, void>
{
void operator() (const T& x) {
auto f = lower_bound(values_.begin(), values_.end(), x);
if(values_.size() < n) {
values_.insert(f, x);
return;
}
if(values_.begin() == f)
return;
auto removed = values_.begin();
values_.splice(removed, values_, removed+1, f);
*removed = x;
}
std::list<T> values() {
return values_;
}
private:
std::list<T> values_;
};
int main()
{
int A[] = {1, 4, 2, 8, 5, 7};
const int N = sizeof(A) / sizeof(int);
auto vals = for_each(A, A + N, top_n_functor<int,2>()).values();
cout << "The top is " << vals.front()
<< " with second place being " << *(vals.begin()+1) << endl;
}
If the largest is the first element, search for the second largest in [largest+1,end). Otherwise search in [begin,largest) and [largest+1,end) and take the maximum of the two. Of course, this has O(2n), so it's not optimal.
If you have random-access iterators, you could do as quick sort does and use the ever-elegant recursion:
template< typename T >
std::pair<T,T> find_two_largest(const std::pair<T,T>& lhs, const std::pair<T,T>& rhs)
{
// implementation finding the two largest of the four values left as an exercise :)
}
template< typename RAIter >
std::pair< typename std::iterator_traits<RAIter>::value_type
, typename std::iterator_traits<RAIter>::value_type >
find_two_largest(RAIter begin, RAIter end)
{
const ptr_diff_t diff = end-begin;
if( diff < 2 )
return std::make_pair(*begin, *begin);
if( diff < 3 )
return std::make_pair(*begin, *begin+1);
const RAIter middle = begin + (diff)/2;
typedef std::pair< typename std::iterator_traits<RAIter>::value_type
, typename std::iterator_traits<RAIter>::value_type >
result_t;
const result_t left = find_two_largest(begin,middle);
const result_t right = find_two_largest(middle,end);
return find_two_largest(left,right);
}
This has O(n) and shouldn't make more comparisons than NomeN's implementation.
top k is usually a bit better than n(log k)
template <class t,class ordering>
class TopK {
public:
typedef std::multiset<t,ordering,special_allocator> BEST_t;
BEST_t best;
const size_t K;
TopK(const size_t k)
: K(k){
}
const BEST_t& insert(const t& item){
if(best.size()<k){
best.insert(item);
return best;
}
//k items in multiset now
//and here is why its better - because if the distribution is random then
//this and comparison above are usually the comparisons that is done;
if(compare(*best.begin(),item){//item better than worst
erase(begin());//the worst
best.insert(item); //log k-1 average as only k-1 items in best
}
return best;
}
template <class it>
const BEST_t& insert(it i,const it last){
for(;i!=last;++i){
insert(*i);
}
return best;
}
};
Of course the special_allocator can in essence be just an array of k multiset value_types and a list of those nodes (which typically has nothing on it as the other k are in use in the multiset until its time to put a new one in and we erase and then immediate ly reuse it. Good to have this or else the memory alloc/free in std::multiset and the cache line crap kills ya. Its a (very) tiny bit of work to give it static state without violating STL allocator rules.
Not as good as a specialized algo for exactly 2 but for fixed k<<n, I would GUESS (2n+delta*n) where delta is small - my DEK ACP vol3 S&S is packed away and an estimate on delta is a bit more work that I want to do.
average worst is I would guess n(log(k-1) + 2) when in opposite order and all distinct.
best is 2n + k(log k) for the k best being the first
I think you could implement a custom array and overload the indexed get/set methods of elements. Then on every set call, compare the new value with two fields for the result. While this makes setter slower, it benefits from caching or even registers. Then its a no op to get the result. This must be faster if you populate array only once per finding maximums. But if array is modified frequently, then it is slower.
If array is used in vectorized loops, then it gets harder to implement as you have to use avx/sse optimized max methods inside setter.

Finding gaps in sequence of numbers

I have a std::vector containing a handful of numbers, which are not in any particular order, and may or may not have gaps between the numbers - for example, I may have { 1,2,3, 6 } or { 2,8,4,6 } or { 1, 9, 5, 2 }, etc.
I'd like a simple way to look at this vector and say 'give me the lowest number >= 1 which does not appear in the vector'. So,
for the three examples above, the answers would be 4, 1 and 3 respectively.
It's not performance critical, and the list is short so there aren't any issues about copying the list and sorting it, for example.
I am not really stuck for a way to do this, but my STL skills are seriously atrophied and I can feel that I'm about to do something inelegant - I would be interested to see what other people came up with.
The standard algorithm you are looking for is std::adjacent_find.
Here is a solution that also uses a lambda to make the predicate clean:
int first_gap( std::vector<int> vec )
{
// Handle the special case of an empty vector. Return 1.
if( vec.empty() )
return 1;
// Sort the vector
std::sort( vec.begin(), vec.end() );
// Find the first adjacent pair that differ by more than 1.
auto i = std::adjacent_find( vec.begin(), vec.end(), [](int l, int r){return l+1<r;} );
// Handle the special case of no gaps. Return the last value + 1.
if ( i == vec.end() )
--i;
return 1 + *i;
}
The checked answer uses < for comparison. != is much simpler:
int find_gap(std::vector<int> vec) {
std::sort(vec.begin(), vec.end());
int next = 1;
for (std::vector<int>::iterator it = vec.begin(); it != vec.end(); ++it) {
if (*it != next) return next;
++next;
}
return next;
}
find_gap(1,2,4,5) = 3
find_gap(2) = 1
find_gap(1,2,3) = 4
I'm not passing a reference to the vector since a) he said time doesn't matter and b) so I don't change the order of the original vector.
Sorting the list and then doing a linear search seems the simplest solution. Depending on the expected composition of the lists you could use a less general purpose sorting algorithm, and if you implement the sort yourself you could keep track of data during the sort that could be used to speed up (or eliminate entirely) the search step. I do not think there is any particularly elegant solution to this problem
You could allocate a bit vector (of the same length as the input vector), initialize it to zero, then mark all indices that occur (note that numbers larger than the length can be ignored). Then, return the first unmarked index (or the length if all indices are marked, which only happens if all indices occur exactly once in the input vector).
This should be asymptotically faster than sort and search. It will use more memory than sorting if you are allowed to destroy the original, but less memory than sorting if you must preserve the original.
Actually, if you do a bubble sort (you know... the one that they teach you first and then tell you to never use again...), you will be able to spot the first gap early in the sorting process, so you can stop there. That should give you the fastest overall time.
Sort-n-search:
std::sort(vec.begin(), vec.end());
int lowest = 1;
for(size_t ii = 1; ii < vec.size(); ++ii)
{
if (vec[ii - 1] + 1 < vec[ii])
{
lowest = (vec[ii - 1] + 1);
break;
}
}
/* 1, 2, ..., N case */
if (lowest == vec[0]) lowest = (*vec.back()) + 1;
Iterators could be used with just as clear intent as showcased in #joe_mucchiello's (ed: better) answer.
OK, here's my 2 cents. Assume you've got a vector of length N.
If N<=2 you can check directly
First, use min_element to get the smallest element, remember it as emin
Call nth_element to get the element at N/2, call it ehalf
If ehalf != emin+N/2 there's a gap to the left, apply this method recursively there by calling nth_element on the whole array but asking for element N/4. Otherwise, recurse on the right asking for element 3*N/4.
This should be slightly better than sorting completely up front.
you could go with something like....
struct InSequence
{
int _current; bool insequence;
InSequence() : _current(1), insequence(true){}
bool operator()(int x) {
insequence = insequence ? (x == _current) : false;
_current++;
return insequence;
}
};
int first_not_in_sequence(std::vector<int>& v)
{
std::sort(v.begin(), v.end());
return 1+std::count_if(v.begin(), v.end(),InSequence());
}
A possible implementation of Thomas Kammeyer's answer
I found Thomas' approach really smart and useful - since some of us dream in code and I find the actual implementation a bit tricky I wanted to provide some ready-to-use code.
The solution presented here is as generic as possible:
No assumption is made on the type of container or range except their iterators must meet the requirements of ValueSwappable and RandomAccessIterator (due to partial sorting with nth_element)
Any number type can be used - the required traits are outlined below
Another improvement I think is that a no-gap condition can be checked early: since we have to scan for the minimum anyway we can also scan for the maximum at the same time and then determine whether the number range even contains a gap worth finding.
Last but not least the same recursive approach can be adapted for sorted ranges! If you encode in a template value parameter whether the range is already sorted, you can simply skip the partial sorting plus make determining minimum/maximum elements a no-op.
#include <type_traits>
#include <iterator>
#include <tuple>
#include <utility>
#include <algorithm>
#include <cstddef>
// number type must be:
// * arithmetic
// * subtractable (a - b)
// * divisible by 2 (a / 2)
// * incrementable (++a)
// * less-than-comparable (a < b)
// * default-constructible (A{})
// * copy-constructible
// * value-constructible (A(n))
// * unsigned or number range must only contain values >0
/** Find lowest gap value in a range */
template<typename Range>
typename std::remove_reference_t<Range>::value_type
lowest_gap_value_unsorted(Range&& r)
{
static_assert(!std::is_lvalue_reference_v<Range> && !std::is_const_v<Range>, "lowest_gap_value_unsorted requires a modifiable copy of the passed range");
return lowest_gap_value_unsorted(std::begin(r), std::end(r), std::size(r));
}
/** Find lowest gap value in a range with specified size */
template<typename Range>
typename std::remove_reference_t<Range>::value_type
lowest_gap_value_unsorted(Range&& r, std::size_t N)
{
static_assert(!std::is_lvalue_reference_v<Range> && !std::is_const_v<Range>, "lowest_gap_value_unsorted requires a modifiable copy of the passed range");
return lowest_gap_value_unsorted(std::begin(r), std::end(r), N);
}
/** Find lowest gap value in an iterator range */
template<typename Iterator>
typename std::iterator_traits<Iterator>::value_type
lowest_gap_value_unsorted(Iterator first, Iterator last)
{
return lowest_gap_value_unsorted(first, last, std::distance(first, last));
}
/** Find lowest gap value in an iterator range with specified size */
template<typename Iterator>
typename std::iterator_traits<Iterator>::value_type
lowest_gap_value(Iterator first, Iterator last, std::size_t N)
{
typedef typename std::iterator_traits<Iterator>::value_type Number;
if (bool empty = last == first)
return increment(Number{});
Iterator minElem, maxElem;
std::tie(minElem, maxElem) = std::minmax_element(first, last);
if (bool contains0 = !(Number{} < *minElem))
throw std::logic_error("Number range must not contain 0");
if (bool missing1st = increment(Number{}) < *minElem)
return increment(Number{});
if (bool containsNoGap = !(Number(N) < increment(*maxElem - *minElem)))
return increment(*maxElem);
return lowest_gap_value_unsorted_recursive(first, last, N, *minElem);
}
template<typename Iterator>
typename std::iterator_traits<Iterator>::value_type
lowest_gap_value_unsorted_recursive(Iterator first, Iterator last, std::size_t N, typename std::iterator_traits<Iterator>::value_type minValue)
{
typedef typename std::iterator_traits<Iterator>::value_type Number;
if (N == 1)
return ++minValue;
if (N == 2)
{
// determine greater of the 2 remaining elements
Number maxValue = !(minValue < *first) ? *std::next(first) : *first;
if (bool gap = ++minValue < maxValue)
return minValue;
else
return ++maxValue;
}
Iterator medianElem = std::next(first, N / 2);
// sort partially
std::nth_element(first, medianElem, last);
if (bool gapInLowerHalf = (Number(N) / 2 < *medianElem - minValue))
return lowest_gap_value_unsorted_recursive(first, medianElem, N / 2, minValue);
else
return lowest_gap_value_unsorted_recursive(medianElem, last, N / 2 + N % 2, *medianElem);
};
template<typename T>
T increment(T v)
{
return ++v;
}