Entropy Optimal Quick Sort - c++

I was doing an exercise for sorting in my Algorithms class where we are required to implement various sorting algorithms and test them against the inputs provided by our professor.
I have the following implementation for quick-sort which is entropy optimal meaning it may be faster than the NlogN bound when a large sequence of elements are equal. The implementation I have done can be found below this post (removed the pastebin link as suggested in the comments)
On running it I found out that it is slower than the std::sort algorithm (I do understand that this is just a difference in the constant for the NlogN) bounds, but as a result I miss the time limits for large input sequences.
Also when the input size is 1000000, std::sort is able to sort but my algorithm gives me a segmentation fault. Can someone please take a look at this and let me know if I am doing something wrong. Thanks in advance.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <random>
#include <utility>
struct Sort {
public:
enum class SortAlg { selection = 0, insertion, shell, merge, mergeBU, quick, heap };
template <typename T, int N>
static void sort(T (&arr) [N], SortAlg alg) {
SortArray<T,N> sortM (arr);
switch (alg) {
case SortAlg::quick:
sortM.quicksort(); break;
default:
sortM.quicksort();
};
}
private:
template <typename T, int N>
class SortArray {
public:
SortArray(T (&a) [N]) : arr(a) {}
void quicksort();
private:
void qsort(int lo, int hi);
std::pair<int, int> partition(int lo, int hi);
T (&arr) [N];
};
};
template <typename T, int N>
void Sort::SortArray<T, N>::quicksort(){
qsort(0, N-1);
}
template <typename T, int N>
void Sort::SortArray<T, N>::qsort(int lo, int hi){
if (lo >= hi) return;
std::pair<int, int> part = partition(lo, hi);
qsort(lo, part.first);
qsort (part.second, hi);
}
//This partitions the algorithm into 3 ranges
//1st range - elements less than the partition element
//2nd range - elements equal to the partition element
//3rd range - elements greater than the partition element
//it returns a pair (a,b) where[a+1, b-1] represents the
//equal range which will be left out of subsequent sorts and
//the next set of sorting will be on [lo,a] and [b,hi]
template <typename T, int N>
std::pair<int, int> Sort::SortArray<T, N>::partition(int lo, int hi){
static int count = 0;
std::random_device rd;
std::mt19937_64 gen(rd());
std::uniform_int_distribution<int> dis;
int elem = lo + (dis(gen) % (hi-lo+1)); //position of element around which paritioning happens
using std::swap;
swap(arr[lo], arr[elem]);
int val = arr[lo];
//after the while loop below completes
//the range of elements [lo, eqind1-1] and [eqind2+1, hi] will all be equal to arr[lo]
//the range of elements [eqind1, gt] will all be less than arr[lo]
//the range of elements [lt, eqind2] will all be greated than arr[lo]
int lt = lo+1, gt = hi, eqind1 = lo, eqind2 = hi;
while (true){
while(lt <= gt && arr[lt] <= val) {
if (arr[lt] == val){
if(lt == eqind1 + 1)
++eqind1;
else
swap(arr[lt], arr[++eqind1]);
}
++lt;
}
while(gt >= lt && arr[gt] >= val) {
if(arr[gt] == val){
if(gt == eqind2)
--eqind2;
else
swap(arr[gt], arr[eqind2--]);
}
--gt;
};
if(lt >= gt) break;
swap(arr[lt], arr[gt]); ++lt; --gt;
};
swap(arr[lo], arr[gt]);
if (eqind1!=lo){
//there are some elements equal to arr[lo] in the first eqind1-1 places
//move the elements which are less than arr[lo] to the beginning
for (int i = 1; i<lt-eqind1; i++)
arr[lo+i] = arr[lo + eqind1+i];
}
if (eqind2!=hi){
//there are some elements which are equal to arr[lo] in the last eqind2-1 places
//move the elements which are greater than arr[lo] towards the end of the array
for(int i = hi; i>gt; i--)
arr[i] = arr[i-hi+eqind2];
}
//calculate the number of elements equal to arr[lo] and fill them up in between
//the elements less than and greater than arr[lo]
int numequals = eqind1 - lo + hi - eqind2 + 1;
if(numequals != 1){
for(int i = 0; i < numequals; i++)
arr[lo+lt-eqind1+i-1] = val;
}
//calculate the range of elements that are less than and greater than arr[lo]
//and return them back to qsort
int lorange = lo + lt-eqind1-2;
int hirange = lo + lt - eqind1 - 1 + numequals;
return {lorange, hirange};
}
int main() {
std::random_device rd;
std::mt19937_64 gen(rd());
std::uniform_int_distribution<int> dis;
constexpr int size = 100000;
int arr[size], arr1[size];
for (int i = 0; i < size; i++){
arr[i] = dis(gen)%9;
arr1[i] = arr[i];;
}
std::sort(std::begin(arr1), std::end(arr1));
std::cout << "Standard sort finished" << std::endl;
Sort::sort(arr, Sort::SortAlg::quick);
std::cout << "Custom sort finished" << std::endl;
int i =0;
int countDiffer = 0;
for (; i <size; ++i){
if (arr[i] != arr1[i]){
countDiffer++;
}
}
if (i == size) std::cout << "Sorted" << std::endl;
else std::cout << "Not sorted and differ in " << countDiffer
<< " places" << std::endl;
}

There is two problems with the code.
A) You are creating a stack allocated array that might be large. Once the stacksize overflows the next page might be anything from unmapped to random heap memory.
B) The other weird thing I noticed is that you initialize an RNG with every call of partition (can also be expensive) which wastes stack space for every partition point.

You have two different problems, that should really warrant two different questions. I will however answer one for you.
Oh, and in the future please don't have links to code, what if that link goes dead? Then your question will be useless.
The problem with the crash is that just about all compilers place local variables (including arrays) on the stack, and the stack is limited. On Windows, for example, the default stack for a process is only a single megabyte.
With two such arrays, each of a 1000000 entries each, you will have eight megabytes (which happens to be the default stack-size for Linux processes), plus of course space for the function call stack frames and all the other local variables and arguments etc. This is beyond (or way beyond) the available stack, and you will have undefined behavior and a probable crash.

Microsoft's std::sort uses introsort. Wiki link:
http://en.wikipedia.org/wiki/Introsort
Introsort switches from quicksort to heapsort if the nesting reaches some limit, primarily for performance reasons, since an indicator that quicksort is going to be slow is excessive nesting, but it also has the side benefit of preventing excessive nesting from running a thread out of stack space.

Related

Why does this quicksort appear to be faster than std::sort?

Why does it appear that this quicksort algorithm is faster than std::sort? I've checked to make sure it's actually sorting the array. I've also replaced both sorting calls with hollow for loops of the same number of iterations to test the timing benchmark and everything checked out there.
I'm also wondering what adjustments I can make to the quicksort to allow it to recurse more times. Maybe some sort of variable memory management?
#include <iostream>
#include <vector>
#include <algorithm>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <chrono>
using namespace std;
void quickSort(int*, int);
void fillRandom(int*, int,int b2);
int main() {
//setup arrays
int size = 100000;
auto myints = new int[size];
auto myints2 = new int[size];
fillRandom(myints, size,10000);
std::copy(myints, myints + size, myints2);
//measurement 1
auto t1 = std::chrono::high_resolution_clock::now();
quickSort(myints, size);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << endl << "Execution 1 took: "<< duration << endl;
//measurement 2
t1 = std::chrono::high_resolution_clock::now();
std::sort(myints2,myints2+size);
t2 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << endl << "Execution 2 took: " << duration << endl;
cout << "finished!";
return 1;
}
void fillRandom(int* p, int size,int upTo) {
srand(time(0));
for (int i = 0;i < size;i++) {
p[i] = rand() % upTo + 1;
}
}
void quickSortSwap(int *p1, int*p2) {
int temp = *p1;
*p1 = *p2;
*p2 = temp;
}
void quickSort(int* original, int len) {
int split = *original;
int greaterIndex = len - 1;
int lesserIndex = 1;
int* currentP;
//rearrange stuff so smaller is left, bigger is right
for (int i = 1;i < len;i++) {
currentP = original + lesserIndex;
//cout << *currentP << " compared to " << split << endl;
if (*currentP <= split) {
lesserIndex++;
}
else {
//cout << "greater: " << *currentP <<endl;
quickSortSwap(currentP, original + greaterIndex);
greaterIndex--;
}
}
//uhh, now we switch pivot element with the right most left side element. Adjust our left side length measurement accordingly.
lesserIndex--;
quickSortSwap(original, original + lesserIndex);
greaterIndex++;
//this point
if (lesserIndex > 1) {
quickSort(original, lesserIndex);
}
int greater_range = len - greaterIndex;
if (greater_range > 1) {
quickSort(original + greaterIndex, greater_range);
}
}
https://rextester.com/AOPBP48224
Visual Studio's std::sort has some overhead and some optimizations that your program does not. Your program is based on Lomuto partition scheme, while std::sort is a single pivot, 3 partition Hoare like quicksort + insertion sort for small partitions. The 3 partitions are elements < pivot, elements == pivot, elements > pivot. If there are no duplicate values, the 3 partition sort is just some overhead. If there are duplicate values, then as the number of duplicate values increases, Lomuto gets worse, and Hoare or std::sort get better. Try using fillRandom(myints, size,10); and you should see a large performance hit with Lomuto method, and an increase in performance with std::sort().
Visual Studio's std::sort uses median of nine if >= 40 elements, median of 3 for 33 to 39 elements, which reduces the probability of worst case scenarios, and switches to insertion sort for <= 32 elements (this speeds it up). To reduce stack overhead, it uses recursion for the smaller partition, and loops back to handle the larger partition. It has a check to avoid worst case time complexity O(n^2), switching to heap sort if the depth of partitioning ("recursions") becomes excessive. It uses iterators instead of plain pointers, but in release mode, my impression is the iterators are unchecked, and effectively pointers. It also uses a function pointer for less than compare, defaulting to std::less, which I don't know if it get optimized away.

Efficient way of finding if a container contains duplicated values with STL? [duplicate]

I wrote this code in C++ as part of a uni task where I need to ensure that there are no duplicates within an array:
// Check for duplicate numbers in user inputted data
int i; // Need to declare i here so that it can be accessed by the 'inner' loop that starts on line 21
for(i = 0;i < 6; i++) { // Check each other number in the array
for(int j = i; j < 6; j++) { // Check the rest of the numbers
if(j != i) { // Makes sure don't check number against itself
if(userNumbers[i] == userNumbers[j]) {
b = true;
}
}
if(b == true) { // If there is a duplicate, change that particular number
cout << "Please re-enter number " << i + 1 << ". Duplicate numbers are not allowed:" << endl;
cin >> userNumbers[i];
}
} // Comparison loop
b = false; // Reset the boolean after each number entered has been checked
} // Main check loop
It works perfectly, but I'd like to know if there is a more elegant or efficient way to check.
You could sort the array in O(nlog(n)), then simply look until the next number. That is substantially faster than your O(n^2) existing algorithm. The code is also a lot cleaner. Your code also doesn't ensure no duplicates were inserted when they were re-entered. You need to prevent duplicates from existing in the first place.
std::sort(userNumbers.begin(), userNumbers.end());
for(int i = 0; i < userNumbers.size() - 1; i++) {
if (userNumbers[i] == userNumbers[i + 1]) {
userNumbers.erase(userNumbers.begin() + i);
i--;
}
}
I also second the reccomendation to use a std::set - no duplicates there.
The following solution is based on sorting the numbers and then removing the duplicates:
#include <algorithm>
int main()
{
int userNumbers[6];
// ...
int* end = userNumbers + 6;
std::sort(userNumbers, end);
bool containsDuplicates = (std::unique(userNumbers, end) != end);
}
Indeed, the fastest and as far I can see most elegant method is as advised above:
std::vector<int> tUserNumbers;
// ...
std::set<int> tSet(tUserNumbers.begin(), tUserNumbers.end());
std::vector<int>(tSet.begin(), tSet.end()).swap(tUserNumbers);
It is O(n log n). This however does not make it, if the ordering of the numbers in the input array needs to be kept... In this case I did:
std::set<int> tTmp;
std::vector<int>::iterator tNewEnd =
std::remove_if(tUserNumbers.begin(), tUserNumbers.end(),
[&tTmp] (int pNumber) -> bool {
return (!tTmp.insert(pNumber).second);
});
tUserNumbers.erase(tNewEnd, tUserNumbers.end());
which is still O(n log n) and keeps the original ordering of elements in tUserNumbers.
Cheers,
Paul
It is in extension to the answer by #Puppy, which is the current best answer.
PS : I tried to insert this post as comment in the current best answer by #Puppy but couldn't so as I don't have 50 points yet. Also a bit of experimental data is shared here for further help.
Both std::set and std::map are implemented in STL using Balanced Binary Search tree only. So both will lead to a complexity of O(nlogn) only in this case. While the better performance can be achieved if a hash table is used. std::unordered_map offers hash table based implementation for faster search. I experimented with all three implementations and found the results using std::unordered_map to be better than std::set and std::map. Results and code are shared below. Images are the snapshot of performance measured by LeetCode on the solutions.
bool hasDuplicate(vector<int>& nums) {
size_t count = nums.size();
if (!count)
return false;
std::unordered_map<int, int> tbl;
//std::set<int> tbl;
for (size_t i = 0; i < count; i++) {
if (tbl.find(nums[i]) != tbl.end())
return true;
tbl[nums[i]] = 1;
//tbl.insert(nums[i]);
}
return false;
}
unordered_map Performance (Run time was 52 ms here)
Set/Map Performance
You can add all elements in a set and check when adding if it is already present or not. That would be more elegant and efficient.
I'm not sure why this hasn't been suggested but here is a way in base 10 to find duplicates in O(n).. The problem I see with the already suggested O(n) solution is that it requires that the digits be sorted first.. This method is O(n) and does not require the set to be sorted. The cool thing is that checking if a specific digit has duplicates is O(1). I know this thread is probably dead but maybe it will help somebody! :)
/*
============================
Foo
============================
*
Takes in a read only unsigned int. A table is created to store counters
for each digit. If any digit's counter is flipped higher than 1, function
returns. For example, with 48778584:
0 1 2 3 4 5 6 7 8 9
[0] [0] [0] [0] [2] [1] [0] [2] [2] [0]
When we iterate over this array, we find that 4 is duplicated and immediately
return false.
*/
bool Foo(int number)
{
int temp = number;
int digitTable[10]={0};
while(temp > 0)
{
digitTable[temp % 10]++; // Last digit's respective index.
temp /= 10; // Move to next digit
}
for (int i=0; i < 10; i++)
{
if (digitTable [i] > 1)
{
return false;
}
}
return true;
}
It's ok, specially for small array lengths. I'd use more efficient aproaches (less than n^2/2 comparisons) if the array is mugh bigger - see DeadMG's answer.
Some small corrections for your code:
Instead of int j = i writeint j = i +1 and you can omit your if(j != i) test
You should't need to declare i variable outside the for statement.
I think #Michael Jaison G's solution is really brilliant, I modify his code a little to avoid sorting. (By using unordered_set, the algorithm may faster a little.)
template <class Iterator>
bool isDuplicated(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::unordered_set<T> values(begin, end);
std::size_t size = std::distance(begin,end);
return size != values.size();
}
//std::unique(_copy) requires a sorted container.
std::sort(cont.begin(), cont.end());
//testing if cont has duplicates
std::unique(cont.begin(), cont.end()) != cont.end();
//getting a new container with no duplicates
std::unique_copy(cont.begin(), cont.end(), std::back_inserter(cont2));
#include<iostream>
#include<algorithm>
int main(){
int arr[] = {3, 2, 3, 4, 1, 5, 5, 5};
int len = sizeof(arr) / sizeof(*arr); // Finding length of array
std::sort(arr, arr+len);
int unique_elements = std::unique(arr, arr+len) - arr;
if(unique_elements == len) std::cout << "Duplicate number is not present here\n";
else std::cout << "Duplicate number present in this array\n";
return 0;
}
As mentioned by #underscore_d, an elegant and efficient solution would be,
#include <algorithm>
#include <vector>
template <class Iterator>
bool has_duplicates(Iterator begin, Iterator end) {
using T = typename std::iterator_traits<Iterator>::value_type;
std::vector<T> values(begin, end);
std::sort(values.begin(), values.end());
return (std::adjacent_find(values.begin(), values.end()) != values.end());
}
int main() {
int user_ids[6];
// ...
std::cout << has_duplicates(user_ids, user_ids + 6) << std::endl;
}
fast O(N) time and space solution
return first when it hits duplicate
template <typename T>
bool containsDuplicate(vector<T>& items) {
return any_of(items.begin(), items.end(), [s = unordered_set<T>{}](const auto& item) mutable {
return !s.insert(item).second;
});
}
Not enough karma to post a comment. Hence a post.
vector <int> numArray = { 1,2,1,4,5 };
unordered_map<int, bool> hasDuplicate;
bool flag = false;
for (auto i : numArray)
{
if (hasDuplicate[i])
{
flag = true;
break;
}
else
hasDuplicate[i] = true;
}
(flag)?(cout << "Duplicate"):("No duplicate");

My QuickSort code does not work for 1000000+ elements (one million elements or more)

I attempted to make my own sorting algorithm (call it MySort for now) and benchmark it against the sorting times of QuickSort. I use a random number generator to make an input file containing n random numbers, then provide this file as input to both MySort and QuickSort, and use std::chrono to time the time they take individually.
(At first I used an online compiler to check the times, but when I hit the limit of 10000 characters as input, I switched to doing it myself on my PC.)
So, for the first few tries (100 elements, 1000 elements, 10000 elements, 100000 elements), everything is working fine. I am getting a proper output time for the amount of time each sorting algorithm takes, but when I try to use 1000000 elements, QuickSort just doesn't give any output (does not seem to work at all), which is strange, because MySort worked just fine. I don't think it is a space issue, since MySort uses 2n additional space and works just fine.
The implementation of QuickSort I am using is given below:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
void quick_sort(int[],int,int);
int partition(int[],int,int);
int main()
{
int n,i;
cin>>n;
int a[n];
for(i=0;i<n;i++)
cin>>a[i];
auto start = high_resolution_clock::now();
quick_sort(a,0,n-1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout<<"Time taken = "<<d.count()<<endl;
/*
cout<<"\nArray after sorting:";
for(i=0;i<n;i++)
cout<<a[i]<<endl;
*/
return 0;
}
void quick_sort(int a[],int l,int u)
{
int j;
if(l<u)
{
j=partition(a,l,u);
quick_sort(a,l,j-1);
quick_sort(a,j+1,u);
}
}
int partition(int a[],int l,int u)
{
int v,i,j,temp;
v=a[l];
i=l;
j=u+1;
do
{
do
i++;
while(a[i]<v&&i<=u);
do
j--;
while(v<a[j]);
if(i<j)
{
temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}while(i<j);
a[l]=a[j];
a[j]=v;
return(j);
}
I tried looking around for solutions as to why it refuses to work for a million elements, but found nothing, besides the possibility that it may be a space issue, which seems unlikely to me considering MySort is working.
As for what exactly I get as output on feeding 1000000 elements in, when I execute both files on the command line, the output I get is (both run twice):
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 512129
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 516131
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>
However, if I run them both for only 100000 elements each, this is what I get:
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 76897.1
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 74019.4
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 16880.2
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 18005.3
C:\Users\Zac\Desktop>
Seems to be working fine.
I am at my wits end, any suggestions would be wonderful.
cin>>n;
int a[n];
This is your bug. You should never do this for three reasons.
This is not valid C++. In C++, the dimension of any array should be a constant expression. You are fooled by a non-conformant extension of gcc. Your code will fail to compile with other compilers. You should always use gcc (and clang) in high conformance mode. For C++, it would be g++ -std=c++17 -Wall -pedantic-errors
A large array local to a function is likely to provoke a stack overflow, since local variables are normally allocated on the stack and stack memory is usually very limited.
C-style arrays are bad, mkay? They don't know their own size, they cannot be easily checked for out-of-bounds access (std::vector and std::array have at() bounds-checking member functions), and they cannot be assigned or passed to functions or returned from functions. Use std::vector instead (or maybe std::array when the size is known in advance).
Let's remove the VLA's you're using and use std::vector. Here is what the code looks like with a sample data of 10 items (but with a check for boundary conditions).
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
using namespace std::chrono;
using vint = std::vector<int>;
void quick_sort(vint&, int, int);
int partition(vint&, int, int);
int main()
{
int n = 10, i;
vint a = { 7, 43, 2, 1, 6, 34, 987, 23, 0, 6 };
auto start = high_resolution_clock::now();
quick_sort(a, 0, n - 1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout << "Time taken = " << d.count() << endl;
return 0;
}
void quick_sort(vint& a, int l, int u)
{
int j;
if (l < u)
{
j = partition(a, l, u);
quick_sort(a, l, j - 1);
quick_sort(a, j + 1, u);
}
}
int partition(vint& a, int l, int u)
{
int v, i, j, temp;
v = a[l];
i = l;
j = u + 1;
do
{
do
i++;
while (a.at(i) < v&&i <= u);
do
j--;
while (v < a[j]);
if (i < j)
{
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
} while (i < j);
a[l] = a[j];
a[j] = v;
return(j);
}
Live Example.
You see that a std::out_of_range error is thrown on the line with the std::vector.at() call.
Bottom line -- your code was flawed to begin with -- whether it was 10, 100, or a million items. You are going out of bounds, thus the behavior is undefined. Usage of std::vector and at() detected the error, something that VLA's will not give you.
Besides VLA, your Quicksort always choose pivot as the first one. This may lead it to perform bad for worst cases. I don't know your output.txt but if the array has been already sorted, it runs O(n^2) because every partitioning would split into one element and the rest(half and half is the best). I think this is why it does not give any outputs for big inputs.
So I would suggest a couple of pivot-choosing heuristics that are commonly used.
Choose it randomly
Choose the median from the 3 elements - lowest/middle/highest index (a[l] / v[(l+u)/2] / v[u])
Once you choose a pivot, you can just simply swap it with v[lo] which minimizes your code changes.

Why is linear search so much faster than binary search?

Consider the following code to find a peak in an array.
#include<iostream>
#include<chrono>
#include<unistd.h>
using namespace std;
//Linear search solution
int peak(int *A, int len)
{
if(A[0] >= A[1])
return 0;
if(A[len-1] >= A[len-2])
return len-1;
for(int i=1; i < len-1; i=i+1) {
if(A[i] >= A[i-1] && A[i] >= A[i+1])
return i;
}
return -1;
}
int mean(int l, int r) {
return l-1 + (r-l)/2;
}
//Recursive binary search solution
int peak_rec(int *A, int l, int r)
{
// cout << "Called with: " << l << ", " << r << endl;
if(r == l)
return l;
if(r == l+ 1)
return (A[l] >= A[l+1])?l:l+1;
int m = mean(l, r);
if(A[m] >= A[m-1] && A[m] >= A[m+1])
return m;
if(A[m-1] >= A[m])
return peak_rec(A, l, m-1);
else
return peak_rec(A, m+1, r);
}
int main(int argc, char * argv[]) {
int size = 100000000;
int *A = new int[size];
for(int l=0; l < size; l++)
A[l] = l;
chrono::steady_clock::time_point start = chrono::steady_clock::now();
int p = -1;
for(int k=0; k <= size; k ++)
// p = peak(A, size);
p = peak_rec(A, 0, size-1);
chrono::steady_clock::time_point end = chrono::steady_clock::now();
chrono::duration<double> time_span = chrono::duration_cast<chrono::duration<double>>(end - start);
cout << "Peak finding: " << p << ", time in secs: " << time_span.count() << endl;
delete[] A;
return 0;
}
If I compile with -O3 and use the linear search solution (the peak function) it takes:
0.049 seconds
If I use the binary search solution which should be much faster (the peak_rec function), it takes:
5.27 seconds
I tried turning off optimization but this didn't change the situation. I also tried both gcc and clang.
What is going on?
What is going on is that you've tested it in one case of a strictly monotonically increasing function. Your linear search routine has a shortcut that checks the final two entries, so it never even does a linear search. You should test random arrays to get a true sense of the distribution of runtimes.
That happens because your linear search solution has an optimization for sorted arrays as the one you are passing into it. if(A[len-1] >= A[len-2]) will return your function before even approaching to enter the search loop when your array is sorted uprising so the complexity there is constant for rising sorted arrays. Your binary search however, does a full search for the array and thus takes much longer. The solution would be to fill your array randomly. You can achieve this by using a random number generator:
int main() {
std::random_device rd; /* Create a random device to seed our twisted mersenne generator */
std::mt19937 gen(rd()); /* create a generator with a random seed */
std::uniform_int_distribution<> range(0, 100000000); /* specify a range for the random values (choose whatever you want)*/
int size = 100000000;
int *A = new int[size];
for(int l=0; l < size; l++)
A[l] = range(gen); /* fill the array with random values in the range of 0 - 100000000
[ . . . ]
EDIT:
One very important thing when you fill your array randomly: your function will not work with unsorted arrays since if the first element is grater than the second or the last one is greater than the previous, the function returns even if there was a value inbetween which is much greater. So remove those lines if you expect unsorted arrays (which you should since a search a peak element is always constant complexity for sorted arrays and there is no point in searching one)

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)
A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};
If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?
(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}
You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.
A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.
One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}
First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.
You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.
I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.
Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).