How can this combination algorithm be optimized? - c++

I am writing a molecular dynamics program that needs to take the atoms in a molecule and find the possible ways they can bond. To do this, I have a vector of Atom objects and I generate combination pairs using the following algorithm:
void CombinationKN(std::vector<std::vector<int>> &indices, int K, int N) {
std::string bitmask(K, 1);
bitmask.resize(N, 0);
do {
/* This loop takes forever with larger N values (approx. 3000) */
std::vector<int> indexRow;
for (int i = 0; i < N; i++)
{
if (bitmask[i]) indexRow.push_back(i);
}
indices.push_back(indexRow);
} while (std::prev_permutation(bitmask.begin(), bitmask.end()));
}
It is a simple N choose K algorithm (i.e. the indices returned could contain (1, 2) but not (2, 1)) where in my case N is the number of atoms in the molecule and K is 2.
I then call the algorithm like this:
void CalculateBondGraph(const std::vector<Atom *> &atoms, std::map<int,
std::map<int, double>> &bondGraph, ForceField *forceField) {
int natoms = atoms.size();
std::vector<std::vector<int>> indices;
utils::CombinationKN(indices, 2, natoms);
for (auto &v : indices) {
int i = v[0];
int j = v[1];
/*... Check if atoms i and j are bonded based on their coordinates.*/
}
}
The issue with this algorithm is that it takes forever to complete for large molecules that have 3000+ atoms. I have thought about parallelizing it (specifically with OpenMP), but even then, the work would have to be split among a few threads and it would still take a lot of time to complete. I need a way to optimize this algorithm so it doesn't take so long to compute combinations. Any help is appreciated.
Thank you,
Vikas

Your CombinationKN function is way more expensive than it needs to be, if K is much smaller than N -- and if N is large then of course K is much smaller than N or you will run out of memory very quickly.
Notice that every valid index_row is a strictly monotonically increasing sequence of K integers less than N and vice-versa. It's easy enough to generate these directly:
void CombinationKN(std::vector<std::vector<int>> &indices, int K, int N) {
std::vector<int> index_row;
// lexographically first valid row
for (int i=0; i<K; ++i) {
index_row.push_back(i);
}
for(;;) {
// output current row
indeces.push_back(index_row);
// increment index_row the the lexically next valid sequence
// find the right-most index we can increment
// This loop does O(1) amortized iterations if K is not large. O(K) worst case
int inc_index=K-1;
int index_limit=N-1;
while(inc_index >= 0 && index_row[inc_index] >= index_limit) {
--inc_index;
--index_limit;
}
if (inc_index < 0) {
break; //all done
}
// generate the lexically first valid row with matching prefix and
// larger value at inc_index
int val = index_row[inc_index]+1;
for (;inc_index<K; ++inc_index, ++val) {
index_row[inc_index] = val;
}
}
}
Also, if the only thing you're doing with these combinations is iterating through them, then there's no reason to waste the (possible very large amount of) memory required to store the whole list of them. The above function contains a procedure for generating the next one from the previous one when you need it.

Related

Iterate through an array randomly but fast

I have a 2d array that I need to iterate through randomly. This is part of an update loop in a little simulation, so it runs about 200 times a second. Currently I am achieving this by creating a array of the appropriate size, filling it with a range, and shuffling it to use to subscript my other arrays.
std::array<int, 250> nums;
std::iota(nums.begin(), nums.end(), 0);
timer += fElapsedTime;
if (timer >= 0.005f)
{
std::shuffle(nums.begin(), nums.end(), engine);
for (int xi : nums)
{
std::shuffle(nums.begin(), nums.end(), engine);
for (int yi : nums)
{
// use xi and yi as array subscript and do stuff
}
}
timer = 0.0f;
}
The issue with this solution is that is is really slow. Just removing the std::shuffles increases the fps by almost 2.5x, so the entire program logic is almost insignificant compared to just these shuffles.
Is there some type of code that would allow me to generate a fixed range (0 - 249) of non-repeating randomly generated ints that I could either use directly or write to an array and then iterate over?
You should shuffle the entire matrix rather than going through one row/column at a time. This should work pretty fast. It's 125k and should be reasonably cache friendly.
constexpr int N = 250;
std::array<uint16_t, N* N> nums;
std::iota(nums.begin(), nums.end(), 0);
std::shuffle(nums.begin(), nums.end(), engine);
for (auto x : nums)
{
auto xi = x / N;
auto yi = x % N;
// Do Stuff indexed on x and y
}

Picking 6 random unique numbers

I have a problem trying to get this to work. I am meant to be picking 6 unique numbers between 1 & 49. I have a function doing this correctly but struggling to check the array for the duplicate and replacing.
srand(static_cast<unsigned int>(time(NULL))); // Seeds a random number
int picked[6];
int number,i,j;
const int MAX_NUMBERS = 6;
for (i = 0; i < MAX_NUMBERS; i++)
{
number = numberGen();
for (int j = 0; j < MAX_NUMBERS; j++)
{
if (picked[i] == picked[j])
{
picked[j] = numberGen();
}
}
}
My number generator just creates a random number between 1 & 49 which i think works ok. I have just started on C++ and any help would be great
int numberGen()
{
int number = rand();
int target = (number % 49) + 1;
return target;
}
C++17 sample
C++17 provides an algorithm for exactly this (go figure):
std::sample
template< class PopulationIterator, class SampleIterator,
class Distance, class UniformRandomBitGenerator >
SampleIterator sample( PopulationIterator first, PopulationIterator last,
SampleIterator out, Distance n,
UniformRandomBitGenerator&& g);
(since C++17)
Selects n elements from the sequence [first; last) such that each
possible sample has equal probability of appearance, and writes those
selected elements into the output iterator out. Random numbers are
generated using the random number generator g. [...]
constexpr int min_value = 1;
constexpr int max_value = 49;
constexpr int picked_size = 6;
constexpr int size = max_value - min_value + 1;
// fill array with [min value, max_value] sequence
std::array<int, size> numbers{};
std::iota(numbers.begin(), numbers.end(), min_value);
// select 6 radom
std::array<int, picked_size> picked{};
std::sample(numbers.begin(), numbers.end(), picked.begin(), picked_size,
std::mt19937{std::random_device{}()});
C++11 shuffle
If you can't use C++17 yet then the way to do this is to generate all the numbers in an array, shuffle the array and then pick the first 6 numbers in the array:
// fill array with [min value, max_value] sequence
std::array<int, size> numbers{};
std::iota(numbers.begin(), numbers.end(), min_value);
// shuffle the array
std::random_device rd;
std::mt19937 e{rd()};
std::shuffle(numbers.begin(), numbers.end(), e);
// (optional) copy the picked ones:
std::array<int, picked_size> picked{};
std::copy(numbers.begin(), numbers.begin() + picked_size, picked.begin());
A side note: please use the new C++11 random library. And prefer std::array to bare C arrays. They don't decay to pointers and provide begin, end, size etc. methods.
Let's break this code down.
for (i = 0; i < MAX_NUMBERS; i++)
We're doing a for-loop with 6 iterations.
number = numberGen();
We're generating a new number, and storing it into the variable number. This variable isn't used anywhere else.
for (int j = 0; j < MAX_NUMBERS; j++)
We're looping through the array again...
if (picked[i] == picked[j])
Checking to see if the two values match (fyi, picked[n] == picked[n] will always match)
picked[j] = numberGen();
And assigning a new random number to the existing value if they do match.
A better approach here would be to eliminate a duplicate value if one exists, then assign it to your array. For example:
for (i = 0; i < MAX_NUMBERS; i++)
{
bool isDuplicate = false;
do
{
number = numberGen(); // Generate the number
// Check for duplicates
for (int j = 0; j < MAX_NUMBERS; j++)
{
if (number == picked[j])
{
isDuplicate = true;
break; // Duplicate detected
}
}
}
while (isDuplicate); // equivalent to while(isDuplicate == true)
picked[j] = number;
}
Here, we run a do-while loop. The first iteration of the loop will generate a random number, and checks to see if it's a duplicate already in the array. If it is, it re-runs the loop until a non-duplicate is found. Once the loop breaks, we have a valid, non-duplicate number available, and then we assign it to the array.
There are going to be better solutions available as you progress through your course.
Efficient approach: Limited Fisher–Yates shuffle
For drawing n numbers from a pool of m you need n calls to random for this approach (6 in your case) instead of m-1 (49 in your case) used when simply shuffling the whole array or vector. So the approach shown below is much more efficient than simply shuffling the whole array and does not require any duplicate checking.
random numbers can get really expensive, so I thought it might be a good idea never to generate more random numbers than necessary. Simply running rand() multiple times until a fitting number comes out seems no good idea.
repetitive double check drawing gets especially expensive in the case that nearly all of the available numbers need to be drawn
I wanted to do it stateful, so it doesn´t matter how many numbers of the 49 you actually request
The solution below does not do any duplicate checking and calls rand() exactly n times for n random numbers. A slight modification of your numberGen was necessary therefore. Albeit you really should use the random library functions instead of rand().
The code below draws all numbers, just to verify that everything works fine, but its easy to see how you would draw only 6 numbers :-)
If you need repetitive draws you can simply add a reset() member function that sets drawn = 0 again. The vector is in shuffled state then, but that doesn´t do any harm.
If you can´t afford the range checking in std::vector.at() you can of course easily replace it by the index access operator[]. But I thought for experimenting with the code at() is a better choice and in this way you get error checking for the case that too many numbers are drawn.
Usage:
Create a class instance of n_out_of_m using the constructor which takes as an argument the amount of available numbers.
Call draw() repetitively to draw numbers.
If you call draw() more often then numbers are available the std::vector.at() will throw an out_of_range exception, if you don´t like that you need to add a check for that case.
I hope someone likes this approach.
#include <iostream>
#include <vector>
#include <algorithm>
#include <cstdlib>
size_t numberGen(size_t limit)
{
size_t number = rand();
size_t target = (number % limit) + 1;
return target;
}
class n_out_of_m {
public:
n_out_of_m(int m) {numbers.reserve(m); for(int i=1; i<=m; ++i) numbers.push_back(i);}
int draw();
private:
std::vector<int> numbers;
size_t drawn = 0;
};
int n_out_of_m::draw()
{
size_t index = numberGen(numbers.size()-drawn) - 1;
std::swap(numbers.at(index), numbers.at(numbers.size()-drawn-1));
drawn++;
return numbers.at(numbers.size()-drawn);
};
int main(int argc, const char * argv[]) {
n_out_of_m my_gen(49);
for(int n=0; n<49; ++n)
std::cout << n << "\t" << my_gen.draw() << "\n";
return 0;
}

finding runtime function involving a while loop

I'm trying to find runtime functions and corresponding big-O notations for two different algorithms that both find spans for each element on a stack. The X passed in is the list that the span is to be computed from and the S passed in is the list for the span. I think I know how to find most of what goes into the runtime functions and once I know what that is, I have a good understanding of how to get to big-O notation. What I need to understand is how to figure out the while loops involved. I think they usually involve logarithms, although I can't see why here because I've been going through with the worst cases being each element is larger than the previous one, so the spans are always getting bigger and I see no connection to logs. Here is what I have so far:
void span1(My_stack<int> X, My_stack<int> &S) { //Algorithm 1
int j = 0; //+1
for(int i = 0; i < X.size(); ++i) { //Find span for each index //n
j = 1; //+1
while((j <= i) && (X.at(i-j) <= X.at(i))) { //Check if span is larger //???
++j; //1
}
S.at(i) = j; //+1
}
}
void span2(My_stack<int> X, My_stack<int> &S) { //Algorithm 2
My_stack<int> A; //empty stack //+1
for(int i = 0; i < (X.size()); ++i) { //Find span for each index //n
while(!A.empty() && (X.at(A.top()) <= X.at(i))) { //???
A.pop(); //1
}
if(A.empty()) //+1
S.at(i) = i+1;
else
S.at(i) = i - A.top();
A.push(i); //+1
}
}
span1: f(n) = 1+n(1+???+1)
span2: f(n) = 1+n(???+1+1)
Assuming all stack operations are O(1):
span1: Outer loop executes n times. Inner loop upto i times for each value of i from 0 to n. Hence total time is proportional to sum of integers from 1 to n, i.e. O(n2)
span2: We need to think about this differently, since the scope of A is function-wide. A starts as empty, so can only be popped as many times as something is pushed onto it, i.e. the inner while loop can only be executed as many times as A.push is called, over the entirety of the function's execution time. However A.push is only called once every outer loop, i.e. n times - so the while loop can only execute n times. Hence the overall complexity is O(n).

Improving O(n) while looping through a 2d array in C++

A goal of mine is to reduce my O(n^2) algorithms into O(n), as it's a common algorithm in my Array2D class. Array2D holds a multidimensional array of type T. A common issue I see is using doubly-nested for loops to traverse through an array, which is slow depending on the size.
As you can see, I reduced my doubly-nested for loops into a single for loop here. It's running fine when I execute it. Speed has surely improved. Is there any other way to improve the speed of this member function? I'm hoping to use this algorithm as a model for my other member functions that have similar operations on multidimensional arrays.
/// <summary>
/// Fills all items within the array with a value.
/// </summary>
/// <param name="ob">The object to insert.</param>
void fill(const T &ob)
{
if (m_array == NULL)
return;
//for (int y = 0; y < m_height; y++)
//{
// for (int x = 0; x < m_width; x++)
// {
// get(x, y) = ob;
// }
//}
int size = m_width * m_height;
int y = 0;
int x = 0;
for (int i = 0; i < size; i++)
{
get(x, y) = ob;
x++;
if (x >= m_width)
{
x = 0;
y++;
}
}
}
Make sure things are contiguous in memory as cache behavior is likely to dominate the run-time of any code which performs only simple operations.
For instance, don't use this:
int* a[10];
for(int i=0;i<10;i++)
a[i] = new int[10];
//Also not this
std::vector< std::vector<int> > a(std::vector<int>(10),10);
Use this:
int a[100];
//or
std::vector<int> a(100);
Now, if you need 2D access use:
for(int y=0;y<HEIGHT;y++)
for(int x=0;x<WIDTH;x++)
a[y*WIDTH+x];
Use 1D accesses for tight loops, whole-array operations which don't rely on knowledge of neighbours, or for situations where you need to store indices:
for(int i=0;i<HEIGHT*WIDTH;i++)
a[i];
Note that in the above two loops the number of items touched is HEIGHT*WIDTH in both cases. Though it may appear that one has a time complexity of O(N^2) and the other O(n), it should be obvious that the net amount of work done is HEIGHT*WIDTH in both cases. It is better to think of N as the total number of items touched by an operation, rather than a property of the way in which they are touched.
Sometimes you can compute Big O by counting loops, but not always.
for (int m = 0; m < M; m++)
{
for (int n = 0; n < N; n++)
{
doStuff();
}
}
Big O is a measure of "How many times is doStuff executed?" With the nested loops above it is executed MxN times.
If we flatten it to 1 dimension
for (int i = 0; i < M * N; i++)
{
doStuff();
}
We now have one loop that executes MxN times. One loop. No improvement.
If we unroll the loop or play games with something like Duff's device
for (int i = 0; i < M * N; i += N)
{
doStuff(); // 0
doStuff(); // 1
....
doStuff(); // N-1
}
We still have MxN calls to doStuff. Some days you just can't win with Big O. If you must call doStuff on every element in an array, no matter how many dimensions, you cannot reduce Big O. But if you can find a smarter algorithm that allows you to avoid calls to doStuff... That's what you are looking for.
For Big O, anyway. Sometimes you'll find stuff that has an as-bad-or-worse Big O yet it outperforms. One of the classic examples of this is std::vector vs std::list. Due to caching and prediction in a modern CPU, std::vector scores a victory that slavish obedience to Big O would miss.
Side note (Because I regularly smurf this up myself) O(n) means if you double n, you double the work. This is why O(n) is the same as O(1,000,000 n). O(n2) means if you double n you do 22 times the work. If you are ever puzzled by an algorithm, drop a counter into the operation you're concerned with and do a batch of test runs with various Ns. Then check the relationship between the counters at those Ns.

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)
A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};
If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?
(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}
You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.
A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.
One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}
First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.
You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.
I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.
Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).