Position of median within a list - c++

I have an unsorted array and I need the position of the median. I know there are several algorithms to calculate the median of a given array in O(n), but all of them include some kind of reordering of the array, like in median of medians and random selection.
I'm not interested int he median itself, only its position within the array interests me.
Is there any way I can do this in O(n)? Keeping track of all the swaps will create a massive overhead, so I'm looking for another solution.

Let's say you have an array of data, and you would like to find its median:
double data[MAX_DATA] = ...
Create an array of indexes, and initialize each index to its own position, like this:
int index[MAX_DATA];
for (int i = 0 ; i != MAX_DATA ; i++) {
index[i] = i;
}
Now implement the linear median algorithm with the following changes:
When the original algorithm compares data[i] to data[j], replace with a comparison of data[index[i]] to data[index[j]]
When the original algorithm swaps data[i] and data[j], swap index[i] and index[j] instead.
Since the elements of data remain in their place all the time, the modified algorithm will produce the position of the median in the unmodified array, rather than its position in the array with some elements moved to different spots.
In C++ you can implement this with pointers instead of indexes, and use std::nth_element on the container of pointers, like this:
vector<int> data = {1, 5, 2, 20, 10, 7, 9, 1000};
vector<const int*> ptr(data.size());
transform(data.begin(), data.end(), ptr.begin(), [](const int& d) {return &d;});
auto mid = next(ptr.begin(), data.size() / 2);
nth_element(ptr.begin(), mid, ptr.end(), [](const int* lhs, const int* rhs) {return *lhs < *rhs;});
ptrdiff_t pos = *mid - &data[0];
cout << pos << endl << data[pos] << endl;
Here is a link to a demo on ideone.

Here's working example that generates a secondary array of indices, and finds the median of the input array through std::nth_element and an indirect comparison
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>
#include <iterator>
int main()
{
// input data, big and expensive to sort or copy
std::string big_data[] = { "hello", "world", "I", "need", "to", "get", "the", "median", "index" };
auto const N = std::distance(std::begin(big_data), std::end(big_data));
auto const M = (N - 1) / 2; // 9 elements, median is 4th element in sorted array
// generate indices
std::vector<int> indices;
auto value = 0;
std::generate_n(std::back_inserter(indices), N, [&](){ return value++; });
// find median of input array through indirect comparison and sorting
std::nth_element(indices.begin(), indices.begin() + M, indices.end(), [&](int lhs, int rhs){
return big_data[lhs] < big_data[rhs];
});
std::cout << indices[M] << ":" << big_data[indices[M]] << "\n";
// check, sort input array and confirm it has the same median
std::sort(std::begin(big_data), std::end(big_data));
std::cout << M << ":" << big_data[M] << "\n";
}
Online output.
This algorithm is guaranteed of O(N) complexity, since it is the sum of std::generate_n and std::nth_element, both of which are O(N) in their input data.

There is an O(n log n) algorithm for keeping track of median on an infinite stream of numbers. (As you don't want to alter the list, you can as well treat it as a stream.) The algorithm involves two heaps; one always points to the maximum number in the lower half and the other points to the minimum number in the higher half. The algorithm is explained here: http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/. You can use the same code with minimum customization.

Related

An efficient algorithm to sample non-duplicate random elements from an array

I'm looking for an algorithm to pick M random elements from a given array. The prerequisites are:
the sampled elements must be unique,
the array to sample from may contain duplicates,
the array to sample from is not necessarily sorted.
This is what I've managed to come up with. Here I'm also making an assumption that the amount of unique elements in the array is greater (or equal) than M.
#include <random>
#include <vector>
#include <algorithm>
#include <iostream>
const std::vector<int> sample(const std::vector<int>& input, size_t n) {
std::random_device rd;
std::mt19937 engine(rd());
std::uniform_int_distribution<int> dist(0, input.size() - 1);
std::vector<int> result;
result.reserve(n);
size_t id;
do {
id = dist(engine);
if (std::find(result.begin(), result.end(), input[id]) == result.end())
result.push_back(input[id]);
} while (result.size() < n);
return result;
}
int main() {
std::vector<int> input{0, 0, 1, 1, 2, 2, 3, 3, 4, 4};
std::vector<int> result = sample(input, 3);
for (const auto& item : result)
std::cout << item << ' ';
std::cout << std::endl;
}
This algorithm does not seem to be the best. Is there a more efficient (with less time complexity) algorithm to solve this task? It would be good if this algorithm could also assert the amount of unique elements in the input array is not less than M (or pick as many unique elements as possible if this is not the case).
Possible solution
As MSalters suggested, I use std::unordered_set to remove duplicates and std::shuffle to shuffle elements in a vector constructed from the set. Then I resize the vector and return it.
const std::vector<int> sample(const std::vector<int>& input, size_t M) {
std::unordered_set<int> rem_dups(input.begin(), input.end());
if (rem_dups.size() < M) M = rem_dups.size();
std::vector<int> result(rem_dups.begin(), rem_dups.end());
std::mt19937 g(std::random_device{}());
std::shuffle(result.begin(), result.end(), g);
result.resize(M);
return result;
}
The comments already note the use of std::set. The additional request to check for M unique elements in the input make that a bit more complicated. Here's an alternative implementation:
Put all inputs in a std::set or std::unordered_set. This removes duplicates.
Copy all elements to the return vector
If that has more than M elements, std::shuffle it and resize it to M elements.
Return it.
Use a set S to store the output, initially empty.
i = 0
while |S| < M && i <= n-1
swap the i'th element of the input with a random greater element
add the newly swapped i'th element to your set if it isn't already there
i++
This will end with S having M distinct elements from your input array (if there are M distinct elements). However, elements which are more common in the input array are more likely to be in S (unless you go through the additional work of eliminating duplicates from the input first).

Find equals value into an array in c++

There is a faster way to find equals value into an array instead of comparing all elements one by one with all the array's elements ?
for(int i = 0; i < arrayLenght; i ++)
{
for(int k = i; k < arrayLenght; i ++)
{
if(array[i] == array[k])
{
sprintf(message,"There is a duplicate of %s",array[i]);
ShowMessage(message);
break;
}
}
}
Since sorting your container is a possible solution, std::unique is the simplest solution to your problem:
std::vector<int> v {0,1,0,1,2,0,1,2,3};
std::sort(begin(v), end(v));
v.erase(std::unique(begin(v), end(v)), end(v));
First, the vector is sorted. You can use anything, std::sort is just the simplest. After that, std::unique shifts the duplicates to the end of the container and returns an iterator to the first duplicate. This is then eaten by erase and effectively removes those from the vector.
You could use std::multiset and then count duplicates afterwards like this:
#include <iostream>
#include <set>
int main()
{
const int arrayLenght = 14;
int array[arrayLenght] = { 0,2,1,3,1,4,5,5,5,2,2,3,5,5 };
std::multiset<int> ms(array, array + arrayLenght);
for (auto it = ms.begin(), end = ms.end(); it != end; it = ms.equal_range(*it).second)
{
int cnt = 0;
if ((cnt = ms.count(*it)) > 1)
std::cout << "There are " << cnt << " of " << *it << std::endl;
}
}
https://ideone.com/6ktW89
There are 2 of 1
There are 3 of 2
There are 2 of 3
There are 5 of 5
If your value_type of this array could be sorted by operator <(a strict weak order) it's a good choice to do as YSC answered.
If not,maybe you can try to define a hash function to hash the objects to different values.Then you can do this in O(n) time complexity,like:
struct ValueHash
{
size_t operator()(const Value& rhs) const{
//do_something
}
};
struct ValueCmp
{
bool operator()(const Value& lhs, const Value& rhs) const{
//do_something
}
};
unordered_set<Value,ValueHash,ValueCmp> myset;
for(int i = 0; i < arrayLenght; i ++)
{
if(myset.find(array[i])==myset.end())
myset.insert(array[i]);
else
dosomething();
}
In case you have a large amount of data, you can first sort the array (quick sort gives you a first pass in O(n*log(n))) and then do a second pass by comparing each value with the next (as they might be all together) to find duplicates (this is a sequential pass in O(n)) so, sorting in a first pass and searching the sorted array for duplicates gives you O(n*log(n) + n), or finally O(n*log(n)).
EDIT
An alternative has been suggested in the comments, of using a std::set to check for already processed data. The algorithm just goes element by element, checking if the element has been seen before. This can lead to a O(n) algorithm, but only if you take care of using a hash set. In case you use a sorted set, then you incur in an O(log(n)) for each set search and finish in the same O(n*log(n)). But because the proposal can be solved with a hash set (you have to be careful in selecting an std::unsorted_set, so you don't get the extra access time per search) you get a final O(n). Of course, you have to account for possible automatic hash table grow or a huge waste of memory used in the hash table.
Thanks to #freakish, who pointed the set solution in the comments to the question.

How to find the minimal missing integer in a list in an STL way

I want to find the minimal missing positive integer in a given list. That is if given a list of positive integers, i.e. larger than 0 with duplicate, how to find from those missing the one that is the smallest.
There is always at least one missing element from the sequence.
For example given
std::vector<int> S={9,2,1,10};
The answer should be 3, because the missing integers are 3,4,5,6,7,8,11,... and the minimum is 3.
I have come up with this:
int min_missing( std::vector<int> & S)
{
int max = std::max_element(S.begin(), S.end());
int min = std::min_element(S.begin(), S.end());
int i = min;
for(; i!=max and std::find(S.begin(), S.end(), i) != S.end() ; ++i);
return i;
}
This is O(nmlogn) in time, but I cannot figure out if there is a more efficient C++ STL way to do this?
This is not an exercise but I am doing a set of problems for self-improvement , and I have found this to be a very interesting problem. I am interested to see how I can improve this.
You could use std::sort, and then use std::adjacent_findwith a custom predicate.
int f(std::vector<int> v)
{
std::sort(v.begin(), v.end());
auto i = std::adjacent_find( v.begin(), v.end(), [](int x, int y)
{
return y != x+1;
} );
if (i != v.end())
{
return *i + 1;
}
}
It is left open what happens when no such element exists, e.g. when the vector is empty.
Find the first missing positive, With O(n) time and constant space
Basiclly, when you read a value a, just swap with the S[a], like 2 should swap with A[2]
class Solution {
public:
/**
* #param A: a vector of integers
* #return: an integer
*/
int firstMissingPositive(vector<int> A) {
// write your code here
int n = A.size();
for(int i=0;i<n;)
{
if(A[i]==i+1)
i++;
else
{
if(A[i]>=1&&A[i]<=n&& A[A[i]-1]!=A[i])
swap(A[i],A[A[i]-1]);
else
i++;
}
}
for(int i=0;i<n;i++)
if(A[i]!=i+1)
return i+1;
return n+1;
}
};
Assuming the data are sorted first:
auto missing_data = std::mismatch(S.cbegin(), S.cend()-1, S.cbegin() + 1,
[](int x, int y) { return (x+1) == y;});
EDIT
As your input data are not sorted, the simplest solution is to sort them first:
std::vector<int> data(S.size());
std::partial_sort_copy (S.cbegin(), S.cend(), data.begin(), data.end());
auto missing_data = std::mismatch (data.cbegin(), data.cend()-1, data.cbegin()+1,
[](int x, int y) { return (x+1) == y;});
you can use algorithm the standard template library c ++ to work in your code.
#include <algorithm> // std::sort
this std::sort in algorithm:
std::vector<int> v={9,2,5,1,3};
std::sort(v.begin(),v.end());
std::cout << v[0];
I hope I understand what you, looking.
You can do this by building a set of integers and adding larger seen in the set, and holding the minimum not seen in as a counter. Once there is a number that is equal to the latter, go through the set removing elements until there is a missing integer.
Please see below for implementation.
template<typename I> typename I::value_type solver(I b, I e)
{
constexpr typename I::value_type maxseen=
std::numeric_limits<typename I::value_type>::max();
std::set<typename I::value_type> seen{maxseen};
typename I::value_type minnotseen(1);
for(I p=b; p!=e;++p)
{
if(*p == minnotseen)
{
while(++minnotseen == *seen.begin())
{
seen.erase(seen.begin());
}
} else if( *p > minnotseen)
{
seen.insert(*p);
}
}
return minnotseen;
}
In case you sequence is in a vector you should use this with:
solver(sequence.begin(),sequence.end());
The algorithm is O(N) in time and O(1) in space since it uses only a counter, constant size additional space, and a few iterators to keep track of the least value.
Complexity ( order of growth rate ) The algorithm keeps a subset only of the input which is expected to be of constant order of growth with respect the growth rate of the input, thus O(1) in space. The growth rate of the iterations is O(N+NlogK) where K is the growth rate of the larger subsequence of seen larger numbers. The latter is the aforementioned subsequence of constant growth rate i.e. K=1 , which results in the algorithm having O(N) complexity. (see comments)

Which STL to use to find index by value in O(1) in C++

Say I have an array arr[] = {1 , 3 , 5, 12124, 24354, 12324, 5}
I want to know the index of the value 5(i.e, 2) in O(1).
How should I go about this?
P.S :
1. Throughout my program, I shall be finding only indices and not the vice versa (getting the value by index).
2. The array can have duplicates.
If you can guarantee there are no duplicates in the array, you're best bet is probably creating an unordered_map where the map key is the array value, and map value is its index.
I wrote a method below that converts an array to an unordered_map.
#include <unordered_map>
#include <iostream>
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_map<T, int>& map)
{
for(int i = 0; i < arrSize; ++i) {
map[arr[i]] = i;
}
}
int main()
{
int arr[] = { 1 , 3 , 5, 12124, 24354, 12324, 5 };
std::unordered_map<int, int> map;
arrayToMap(arr, sizeof(arr)/sizeof(*arr), map);
std::cout << "Value" << '\t' << "Index" << std::endl;
for(auto it = map.begin(), e = map.end(); it != e; ++it) {
std::cout << it->first << "\t" << it->second << std::endl;
}
}
However, in your example you use the value 5 twice. This causes a strange output in the above code. The outputted map does not have a value with an index 2. Even if you use an array, you would be confronted with a similar problem (i.e. should you use the value at 2 or 6?).
If you really need both values, you could use unordered_multimap, but the syntax for accessing elements isn't easy as using the operator[] (you have to use unordered_multipmap::find() which returns an iterator).
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_multimap<T, int>& map)
{
for(int i = 0; i < arrSize; ++i) {
map.emplace(arr[i], i);
}
}
Finally, you should consider that unordered_map's fast look-up time O(1) comes with some overhead, so it uses more memory than a simple array. But if you end up using an array (which is comparatively much more memory efficient), searching for a specific value is guaranteed to be O(n) where n is the index of the value.
Edit - If you need the duplicate with the lowest index to be kept instead of the highest, you can just reverse the order of insertion:
template <typename T>
void arrayToMap(const T arr[], size_t arrSize, std::unordered_map<T, int>& map)
{
for(int i = arraySize - 1; i >= 0; --i) {
map[arr[i]] = i;
}
}
Use std::unordered_map from C++11 to map elements as key and indices as value. Then you can get answer of your query in amortized O(1) complexity. std::unordered_map will work because there is no duplicacy as you said but cost you linear size extra space.
If your value's range is not too large, you can use an array as well. This will yield even better theta(1) complexity.
use unordered_multimap (C++11 only) with the value as the key, and the position index as the value.

What's the practical difference between std::nth_element and std::sort?

I've been looking at the std::nth_element algorithm which apparently:
Rearranges the elements in the range [first,last), in such a way that
the element at the resulting nth position is the element that would be
in that position in a sorted sequence, with none of the elements
preceding it being greater and none of the elements following it
smaller than it. Neither the elements preceding it nor the elements
following it are guaranteed to be ordered.
However, with my compiler, running the following:
vector<int> myvector;
srand(GetTickCount());
// set some values:
for ( int i = 0; i < 10; i++ )
myvector.push_back(rand());
// nth_element around the 4th element
nth_element (myvector.begin(), myvector.begin()+4, myvector.end());
// print results
for (auto it=myvector.begin(); it!=myvector.end(); ++it)
cout << " " << *it;
cout << endl;
Always returns a completely sorted list of integers in exactly the same way as std::sort does. Am I missing something? What is this algorithm useful for?
EDIT: Ok the following example using a much larger set shows that there is quite a difference:
vector<int> myvector;
srand(GetTickCount());
// set some values:
for ( int i = 0; i < RAND_MAX; i++ )
myvector.push_back(rand());
// nth_element around the 4th element
nth_element (myvector.begin(), myvector.begin()+rand(), myvector.end());
vector<int> copy = myvector;
std::sort(myvector.begin(), myvector.end());
cout << (myvector == copy ? "true" : "false") << endl;
It's perfectly valid for std::nth_element to sort the entire range for fulfilling the documented semantic - however, doing so will fail at meeting the required complexity (linear). The key point is that it may do so, but it doesn't have to.
This means that std::nth_element can bail out early - as soon as it can tell what the n'th element of your range is going to be, it can stop. For instance, for a range
[9,3,6,2,1,7,8,5,4,0]
asking it to give you the fourth element may yield something like
[2,0,1,3,8,5,6,9,7,4]
The list was partially sorted, just good enough to be able to tell that the fourth element in order will be 3.
Hence, if you want to answer 'which number is the fourth-smallest' or 'which are the four smallest' numbers then std::nth_element is your friend.
If you want to get the four smallest numbers in order you may want to consider using std::partial_sort.
The implementation of std::nth_element looks as follows:
void _Nth_element(_RanIt _First, _RanIt _Nth, _RanIt _Last, _Pr _Pred)
{
for (; _ISORT_MAX < _Last - _First; )
{ // divide and conquer, ordering partition containing Nth
pair<_RanIt, _RanIt> _Mid =
_Unguarded_partition(_First, _Last, _Pred);
if (_Mid.second <= _Nth)
_First = _Mid.second;
else if (_Mid.first <= _Nth)
return; // Nth inside fat pivot, done
else
_Last = _Mid.first;
}
_Insertion_sort(_First, _Last, _Pred); // sort any remainder
}
where ISORT_MAX defined as 32.
So if your sequence is shoter than 32 elements it just performs InsertionSort on it.
Therefore your short sequence is full sorted.
std::sort sorts all the elements. std::nth_elenemt doesn't. It just puts the nth element in the nth positions, with smaller or equal elements on one side and larger or equal elements on the other. It is used if you want to find the nth element (obviously) or if you want the n smallest or largest elements. A full sort satisfies these requirements.
So why not just perform a full sort and get the nth element? Because std::nth_element has the requirement of having O(N) complexity, whereas std::sort is O(Nlog(N)). std::sort cannot satisfy the complexity requirement of std::nth_element.
If you do not need complete sorting of the range, it is advantageous to use it.
As for your example, when I run similar code on GCC 4.7, I get the expected results:
for ( int i = 0; i < 10; i++ )
myvector.push_back(rand()%32); // make the numbers small
cout << myvector << "\n";
// nth_element around the 4th element
nth_element (myvector.begin(), myvector.begin()+4, myvector.end());
cout << myvector << "\n";
std::sort(myvector.begin(), myvector.end());
cout << myvector << "\n";
produces
{ 7, 6, 9, 19, 17, 31, 10, 12, 9, 13 }
{ 9, 6, 9, 7, 10, 12, 13, 31, 17, 19 }
{ 6, 7, 9, 9, 10, 12, 13, 17, 19, 31 }
^
where I've used a custom made ostream operator<< to print out the results.
I have compared execution times of std::sort vs. std::nth_element when running on a large vectors (512MB) of random unsigned long long's and taking middle element of it. Yes, I know it is O(N log(N)) vs O(N), anyway somehow I expected std::nth_element(mid) to be about twice as fast as std::sort, as it should be interested in "sorting" about half of elements. Results surprised me a bit, that's why I'm sharing them:
timeSort = 217407 (msec)
timeNthElement = 18218 (msec)
std::sort was about 12 times slower
Here is the piece of code I used (it is using windows.h) :
#include <windows.h>
#include <string>
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <random>
int main()
{
static const size_t NUMELEM = 512 * 1024 * 1024;
static const size_t NUMITER = 3;
std::vector<unsigned long long> vec1(NUMELEM);
std::vector<unsigned long long> vec2(NUMELEM);
std::random_device rd;
std::mt19937 rand(rd());
std::uniform_int_distribution<unsigned long long> dist(0, NUMELEM * 2);
unsigned long long timeNthElement = 0;
unsigned long long timeSort = 0;
for (size_t j = 0; j < NUMITER; ++j)
{
for (size_t i = 0; i < NUMELEM; ++i)
{
unsigned long long val = dist(rand);
vec1[i] = val;
vec2[i] = val;
}
ULONGLONG t1 = GetTickCount64();
std::sort(begin(vec1), end(vec1));
ULONGLONG t2 = GetTickCount64();
std::nth_element(begin(vec2), begin(vec2)+NUMELEM/2, end(vec2));
ULONGLONG t3 = GetTickCount64();
if (vec1[NUMELEM / 2] != vec2[NUMELEM / 2])
{
Sleep(0); // I put a breakpoint here but of course never caught it...
}
timeSort += t2 - t1;
timeNthElement += t3 - t2;
}
std::cout << "timeSort = " << timeSort << std::endl;
std::cout << "timeNthElement = " << timeNthElement << std::endl;
return 0;
}