Why does std::chrono say my function took zero nanoseconds to execute?

Why does std::chrono say my function took zero nanoseconds to execute? - c++

I am working on a project where I implement two popular MST algorithms in C++ and then print how long each one takes to execute. Please ignore the actual algorithms, I have already tested them and am only interested in getting accurate measurements of how long they take.
void Graph::krushkalMST(bool e){
size_t s2 = size * size;
typedef struct{uint loc; uint val;} wcType; //struct used for storing a copy of the weights values to be sorted, with original locations
wcType* weightsCopy = new wcType[s2]; //copy of the weights which will be sorted.
for(int i = 0; i < s2; i++){
weightsCopy[i].loc = i;
weightsCopy[i].val = weights[i];
}
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //time the program was started
typedef int (*cmpType)(const void*, const void*); //comparison function type
static cmpType cmp = [](const void* ua, const void* ub){ //Compare function used by the sort as a C++ lambda
uint a = ((wcType*)ua)->val, b = ((wcType*)ub)->val;
return (a == b) ? 0 : (a == NULLEDGE) ? 1 : (b == NULLEDGE) ? -1 : (a < b) ? -1 : 1;
};
std::qsort((void*)weightsCopy, s2, sizeof(wcType), cmp); //sort edges into ascending order using a quick sort (supposedly quick sort)
uint* componentRefs = new uint[size]; //maps nodes to what component they currently belong to
std::vector<std::vector<uint>> components(size); //vector of components, each component is a vector of nodes;
for(int i = 0; i < size; i++){
//unOptimize(components);
components[i] = std::vector<uint>({(uint)i});
componentRefs[i] = i;
}
for(int wcIndex = 0; components.size() >= 2 ; wcIndex++){
uint i = getI(weightsCopy[wcIndex].loc), j = getJ(weightsCopy[wcIndex].loc); //get pair of nodes with the smallest edge
uint ci = componentRefs[i], cj = componentRefs[j]; //locations of nodes i and j
if(ci != cj){
T.push_back(weightsCopy[wcIndex].loc); //push the edge into T
for(int k = 0; k < components[cj].size(); k++) //move each member in j's component to i's component
components[ci].push_back(components[cj][k]);
for(int k = 0; k < components[cj].size(); k++) //copy this change into the reference locations
componentRefs[components[cj][k]] = ci;
components.erase(components.begin() + cj); //delete j's component
for(int k = 0; k < size; k++)
if(componentRefs[k] >= cj)
componentRefs[k]--;
}
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"\nMST found my krushkal's Algorithm:\n";
printData(time, T, e);
delete[] weightsCopy;
delete[] componentRefs;
}
void Graph::primMST(bool e){
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //Start calculating the time the algorithm takes
bool* visited = new bool[size]; //Maps each node to a visited value
visited[0] = true;
for(int i = 1; i < size; i++)
visited[i] = false;
for(uint numVisited = 1; numVisited < size; numVisited++){
uint index = 0; //index of the smallest cost edge to unvisited node
uint minCost = std::numeric_limits<uint>::max(); //cost of the smallest edge filling those conditions
for(int i = 0; i < size; i++){
if(visited[i]){
for(int j = 0; j < size; j++){
if(!visited[j]){
uint curIndex = i * size + j, weight = dweights[curIndex];
if(weight != NULLEDGE && weight < minCost){
index = curIndex;
minCost = weight;
}
}
}
}
}
T.push_back(index);
visited[getI(index)] = true;
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
std::cout<<"\nMST found my Prim's Algorithm:\n";
printData(time, T, e);
delete[] visited;
}
I initially used clock() from <ctime> to try and get an accurate measurement of how long this would take, my largest test file has a graph of 40 nodes with 780 edges (sufficiently large enough to warrant some compute time), and even then on a slow computer using g++ with -O0 i would get either 0 or 1 milliseconds. On my desktop I was only ever able to get 0 ms, however as I need a more accurate way to distinguish time between test cases I decided I would try for the high_resolution_clock provided by the <chrono> library.
This is where the real trouble began, I would (and still) consistently get that the program took 0 nanoseconds to execute.
In my search for a solution I came across multiple questions that deal with similar issues, most of which state that <chrono> is system dependent and you're unlikely to actually be able to get nanosecond or even microsecond values. Never the less, I tried using std::chrono::microsecond only to still consistently get 0. Eventually I found what I thought was someone who was having the same problem as me:
counting duration with std::chrono gives 0 nanosecond when it should take long
However, this is clearly a problem of an overactive optimizer which has deleted an unnecessary piece of code, whereas in my case the end result always depends on the results for series of complex loops which must be executed in full. I am on Windows 10, compiling with GCC using -O0.
My best hypothesis is I'm doing something wrong or that windows doesn't support anything smaller then milliseconds while using std::chrono and std::chrono::nanoseconds are actually just milliseconds padded with 0s on the end (as I observe when I put a system("pause") in the algorithm and unpause at arbitrary times). Please let me know if you find anyway around this or if there is any other way I can achieve higher resolution time.
At the request of #Ulrich Eckhardt, I am including minimal reproducible example as well as the results of the test I preformed using it, and I must say it is rather insightful.
#include<iostream>
#include<chrono>
#include<cmath>
int main()
{
double c = 1;
for(int itter = 1; itter < 10000000; itter *= 10){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < itter; i++)
c += sqrt(c) + log(c);
auto end = std::chrono::high_resolution_clock::now();
int time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"calculated: "<<c<<". "<<itter<<" iterations took "<<time<<"ns\n";
}
system("pause");
}
For my loop I choose a random arbitrary mathematical formula and make sure to use the result of what the loop does so it's not optimized out of existence. Testing it with various iterations on my desktop yields:
This seems to imply that a certain threshold is required before the it starts counting time, since dividing the time taken by the first result that yields non-zero time by 10, we get another non-zero time which is not what the result says despite that being how it should work assuming this whole loop is takes O(n) time with n iterations that is. If anything this small example baffles me even further.

Switch to steady_clock and you get the correct results for both MSVC and MinGW GCC.
You should avoid using the high_resolution_clock as it is just an alias to either steady_clock or system_clock. For measuring elapsed time in a stop watch like fashion, you always want steady_clock. high_resolution_clock is an unfortunate thing and should be avoided.
I just checked and MSVC has the following:
using high_resolution_clock = steady_clock;
while MinGW GCC has:
/**
* #brief Highest-resolution clock
*
* This is the clock "with the shortest tick period." Alias to
* std::system_clock until higher-than-nanosecond definitions
* become feasible.
*/
using high_resolution_clock = system_clock;

Related

faster access to random elements in c++ array

What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; the answer is following on my machine-
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster.
The code improvement could be in terms of the container itself ( e.g. list/vector rather than array). Could software prefetching be implemented?

In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - e.g. _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:
It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.
If you want to speed up random memory access, there are better methods than prefetching imho.
Reduce the size of the data (i.e. use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc). By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
Sort your data so that instead of doing random access, you are processing the data sequentially.
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required).

To give an example of what #robthebloke says, the following code makes ~15% improvment on my machine:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}

Almost same code running much slower

I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine

Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.

Finding least number in a set with CUDA

Suppose you had a function that would take in a vector, a set of vectors, and find which vector in the set of vectors was closest to the original vector. It may be useful if I included some code:
int findBMU(float * inputVector, float * weights){
int count = 0;
float currentDistance = 0;
int winner = 0;
float leastDistance = 99999;
for(int i = 0; i<10; i++){
for(int j = 0;j<10; j++){
for(int k = 0; k<10; k++){
int offset = (i*100+j*10+k)*644;
for(int i = offset; i<offset+644; i++){
currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<leastDistance){
winner = offset;
leastDistance = currentDistance;
}
currentDistance = 0;
}
}
}
return winner;
}
In this example, weights is a single dimensional array, with a block of 644 elements corresponding to one vector. inputVector is the vector that's being compared, and it also has 644 elements.
To speed up my program, I decided to take a look at the CUDA framework provided by NVIDIA. This is what my code looked like once I changed it to fit CUDA's specifications.
__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){
int i = threadIdx.x+(blockIdx.x*blockDim.x);
if(i<1000){
int offset = i*644;
int count = 0;
float currentDistance = 0;
for(int w = offset; w<offset+644; w++){
currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<*leastDistance){
*winner = offset;
*leastDistance = currentDistance;
}
currentDistance = 0;
}
}
To call the function, I used : findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);
But, when I would call the function, sometimes it would give me the right answer, and sometimes it wouldn't. After doing some research, I found that CUDA has some issues with reduction problems like these, but I couldn't find how to fix it. How can I modify my program to make it work with CUDA?

The issue is that threads that run concurrently will see the same leastDistance and overwrite each other's results. There are two values that are shared between threads; leastDistance and winner. You have two basic options. You can write out the results from all the threads and then do a second pass over the data with a parallel reduction to determine which vector had the best match or you can implement this with a custom atomic operation using atomicCAS().
The first method is the easiest. My guess is that it will also give you the best performance, though it does add a dependency for the the free Thrust library. You would use thrust::min_element().
The method using atomicCAS() uses the fact that atomicCAS() has a 64-bit mode, in which you can assign any semantics that you wish to a 64-bit value. In your case, you would use 32 bits to store leastDistance and 32 bits to store winner. To use this method, adapt this example in the CUDA C Programming Guide that implements a double precision floating point atomicAdd().
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}

Branch Prediction: Writing Code to Understand it; Getting Weird Results

I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.
So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.
I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.
However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.
(X-axis is amount of time-wasting work, Y-axis is time-to-run)
Does anyone understand this behavior? You can see the code I'm running at the following code:
#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;
int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
int* zeroNums = new int[pipelineLen];
int* oneNums = new int[pipelineLen];
for(int i = 0; i < pipelineLen; ++i)
zeroNums[i] = oneNums[i] = 0;
chrono::time_point<chrono::system_clock> start, end;
start = chrono::system_clock::now();
for(int i = 0; i < s_iArrayLen; ++i){
if(vals[i] == 0){
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
else{
for(int i = 0; i < pipelineLen; ++i)
++oneNums[i];
}
}
end = chrono::system_clock::now();
int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();
//This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
for(int i = 0; i < pipelineLen - 1; ++i)
if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
return -1;
delete[] zeroNums;
delete[] oneNums;
return elapsedMicroseconds;
}
struct TestMethod{
string name;
void (*func)(int, int&);
int* results;
TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};
int main(){
srand( (unsigned int)time(nullptr) );
vector<TestMethod> testMethods;
testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));
int* vals = new int[s_iArrayLen];
for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
int resultsSum = 0;
for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
//Generate a new array...
for(int i = 0; i < s_iArrayLen; ++i)
testMethods[currentMethod].func(i, vals[i]);
//And record how long it takes
resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
}
testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
}
}
cout << "\t";
for(int i = 0; i < s_iMaxPipelineLen; ++i){
cout << i << "\t";
}
cout << "\n";
for (int i = 0; i < (int)testMethods.size(); ++i){
cout << testMethods[i].name.c_str() << "\t";
for(int j = 0; j < s_iMaxPipelineLen; ++j){
cout << testMethods[i].results[j] << "\t";
}
cout << "\n";
}
int end;
cin >> end;
delete[] vals;
}
Pastebin link: http://pastebin.com/F0JAu3uw

I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.
I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.
Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.
Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:
Is that more what you expected?
If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...
Edit:
And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.
I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.

In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:
if(vals[i] == 0)
{
for(int i = 0; i < pipelineLen; ++i)
++zeroNums[i];
}
i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.
I'm not sure if this can explain wavy pattern of your results, but:
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
So your loops are not only measuring memory throughput but they are also affecting BTB.
If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).
Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)

A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};

If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?

(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}

You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.

A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.

One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}

First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.

You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.

I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.

Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js