A goal of mine is to reduce my O(n^2) algorithms into O(n), as it's a common algorithm in my Array2D class. Array2D holds a multidimensional array of type T. A common issue I see is using doubly-nested for loops to traverse through an array, which is slow depending on the size.
As you can see, I reduced my doubly-nested for loops into a single for loop here. It's running fine when I execute it. Speed has surely improved. Is there any other way to improve the speed of this member function? I'm hoping to use this algorithm as a model for my other member functions that have similar operations on multidimensional arrays.
/// <summary>
/// Fills all items within the array with a value.
/// </summary>
/// <param name="ob">The object to insert.</param>
void fill(const T &ob)
{
if (m_array == NULL)
return;
//for (int y = 0; y < m_height; y++)
//{
// for (int x = 0; x < m_width; x++)
// {
// get(x, y) = ob;
// }
//}
int size = m_width * m_height;
int y = 0;
int x = 0;
for (int i = 0; i < size; i++)
{
get(x, y) = ob;
x++;
if (x >= m_width)
{
x = 0;
y++;
}
}
}
Make sure things are contiguous in memory as cache behavior is likely to dominate the run-time of any code which performs only simple operations.
For instance, don't use this:
int* a[10];
for(int i=0;i<10;i++)
a[i] = new int[10];
//Also not this
std::vector< std::vector<int> > a(std::vector<int>(10),10);
Use this:
int a[100];
//or
std::vector<int> a(100);
Now, if you need 2D access use:
for(int y=0;y<HEIGHT;y++)
for(int x=0;x<WIDTH;x++)
a[y*WIDTH+x];
Use 1D accesses for tight loops, whole-array operations which don't rely on knowledge of neighbours, or for situations where you need to store indices:
for(int i=0;i<HEIGHT*WIDTH;i++)
a[i];
Note that in the above two loops the number of items touched is HEIGHT*WIDTH in both cases. Though it may appear that one has a time complexity of O(N^2) and the other O(n), it should be obvious that the net amount of work done is HEIGHT*WIDTH in both cases. It is better to think of N as the total number of items touched by an operation, rather than a property of the way in which they are touched.
Sometimes you can compute Big O by counting loops, but not always.
for (int m = 0; m < M; m++)
{
for (int n = 0; n < N; n++)
{
doStuff();
}
}
Big O is a measure of "How many times is doStuff executed?" With the nested loops above it is executed MxN times.
If we flatten it to 1 dimension
for (int i = 0; i < M * N; i++)
{
doStuff();
}
We now have one loop that executes MxN times. One loop. No improvement.
If we unroll the loop or play games with something like Duff's device
for (int i = 0; i < M * N; i += N)
{
doStuff(); // 0
doStuff(); // 1
....
doStuff(); // N-1
}
We still have MxN calls to doStuff. Some days you just can't win with Big O. If you must call doStuff on every element in an array, no matter how many dimensions, you cannot reduce Big O. But if you can find a smarter algorithm that allows you to avoid calls to doStuff... That's what you are looking for.
For Big O, anyway. Sometimes you'll find stuff that has an as-bad-or-worse Big O yet it outperforms. One of the classic examples of this is std::vector vs std::list. Due to caching and prediction in a modern CPU, std::vector scores a victory that slavish obedience to Big O would miss.
Side note (Because I regularly smurf this up myself) O(n) means if you double n, you double the work. This is why O(n) is the same as O(1,000,000 n). O(n2) means if you double n you do 22 times the work. If you are ever puzzled by an algorithm, drop a counter into the operation you're concerned with and do a batch of test runs with various Ns. Then check the relationship between the counters at those Ns.
Related
I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.
I am writing a molecular dynamics program that needs to take the atoms in a molecule and find the possible ways they can bond. To do this, I have a vector of Atom objects and I generate combination pairs using the following algorithm:
void CombinationKN(std::vector<std::vector<int>> &indices, int K, int N) {
std::string bitmask(K, 1);
bitmask.resize(N, 0);
do {
/* This loop takes forever with larger N values (approx. 3000) */
std::vector<int> indexRow;
for (int i = 0; i < N; i++)
{
if (bitmask[i]) indexRow.push_back(i);
}
indices.push_back(indexRow);
} while (std::prev_permutation(bitmask.begin(), bitmask.end()));
}
It is a simple N choose K algorithm (i.e. the indices returned could contain (1, 2) but not (2, 1)) where in my case N is the number of atoms in the molecule and K is 2.
I then call the algorithm like this:
void CalculateBondGraph(const std::vector<Atom *> &atoms, std::map<int,
std::map<int, double>> &bondGraph, ForceField *forceField) {
int natoms = atoms.size();
std::vector<std::vector<int>> indices;
utils::CombinationKN(indices, 2, natoms);
for (auto &v : indices) {
int i = v[0];
int j = v[1];
/*... Check if atoms i and j are bonded based on their coordinates.*/
}
}
The issue with this algorithm is that it takes forever to complete for large molecules that have 3000+ atoms. I have thought about parallelizing it (specifically with OpenMP), but even then, the work would have to be split among a few threads and it would still take a lot of time to complete. I need a way to optimize this algorithm so it doesn't take so long to compute combinations. Any help is appreciated.
Thank you,
Vikas
Your CombinationKN function is way more expensive than it needs to be, if K is much smaller than N -- and if N is large then of course K is much smaller than N or you will run out of memory very quickly.
Notice that every valid index_row is a strictly monotonically increasing sequence of K integers less than N and vice-versa. It's easy enough to generate these directly:
void CombinationKN(std::vector<std::vector<int>> &indices, int K, int N) {
std::vector<int> index_row;
// lexographically first valid row
for (int i=0; i<K; ++i) {
index_row.push_back(i);
}
for(;;) {
// output current row
indeces.push_back(index_row);
// increment index_row the the lexically next valid sequence
// find the right-most index we can increment
// This loop does O(1) amortized iterations if K is not large. O(K) worst case
int inc_index=K-1;
int index_limit=N-1;
while(inc_index >= 0 && index_row[inc_index] >= index_limit) {
--inc_index;
--index_limit;
}
if (inc_index < 0) {
break; //all done
}
// generate the lexically first valid row with matching prefix and
// larger value at inc_index
int val = index_row[inc_index]+1;
for (;inc_index<K; ++inc_index, ++val) {
index_row[inc_index] = val;
}
}
}
Also, if the only thing you're doing with these combinations is iterating through them, then there's no reason to waste the (possible very large amount of) memory required to store the whole list of them. The above function contains a procedure for generating the next one from the previous one when you need it.
NB: I'm open to suggestion of a better title..
Imagine an nxn square, stored as an integer array.
What is the most efficient method of generating an n-length array of the integers in each of the n non-overlapping sqrt(n)xsqrt(n) sub-squares?
A special case (n=9) of this is Sudoku, if we wanted the numbers in the smaller squares.
The only method I can think of is something like:
int square[n][n], subsq[n], len;
int s = sqrt(n);
for(int j=0; j<n; j+=s){
for(int i=0; i<n; i+=s){
//square[i][j] is the top-left of each sub-square
len = 0;
for(int y=j; y<j+s; y++){
for(int x=i; x<i+s; x++){
subsq[len] = square[x][y];
len++;
}
}
}
}
But this seems loopy, if you'll forgive me the pun.
Does anyone have a more efficient suggestion?
Despite the four level loop, you are only accessing each array element at most one time, so the complexity of your approach is O(n^2), and not O(n^4) as the four loop levels suggest. And, since you actually want to look at all elements, this is close to optimal.
There is only one possible suboptimality: Incomplete use of cachelines. If s is not a multiple of a cache line, your subsquares will end in the middle of a cacheline, leading to parts of the data being fetched twice from memory. However, this is only an issue if your subsquares do not fit into cache anymore, so you need a very large problem size to trigger this. For a sudoku square, there is no faster way than the one you've given.
To work around this cacheline issue (once you determined that this is really worth it!), you can go through your matrix one line at a time, aggregating data for ciel(n/sqrt(n)) subsquares in an output array. This would exchange the loops in the following way:
for(int j=0; j<n; j+=s){
for(int y=j; y<j+s; y++){
for(int i=0; i<n; i+=s){
for(int x=i; x<i+s; x++){
However, this will only work out if the intermediate data you need to hold while traversing a single subsquare is small. If you need to copy the entire data into a temporary array like you do, you won't gain anything.
If you really want to optimize, try to get away from storing the data in the temporary subseq array. Try to interprete the data you find directly where you read it from the matrix. If you are indeed checking sudoku squares, it is possible to avoid this temporary array.
From the way you pose the question, I presume that your goal is to pass the data in each subsquare to an analysis function in turn. If that is the case, you can simply pass a pointer to the 2D subarray to the function like this:
void analyse(int width, int height, int (*subsquare)[n]) {
for(int y = 0; y < height; y++) {
for(int x = 0; x < width; x++) {
subsquare[y][x]; //do anything you like with this value
}
}
}
int main() {
int square[n][n], subsq[n], len;
int s = sqrt(n);
for(int j=0; j<n; j+=s){
for(int i=0; i<n; i+=s){
analyse(s, s, (int (*)[n])&square[i][j]);
}
}
}
Now you can just pass any 2D subarray shape to your analysis function by varying the first two parameters, and completely avoid a copy.
I have a vector of numbers between 1 and 100(this is not important) which can take sizes between 3 and 1.000.000 values.
If anyone can help me getting 3 value unique* combinations from that vector.
*Unique
Example: I have in the array the following values: 1[0] 5[1] 7[2] 8[3] 7[4] (the [x] is the index)
In this case 1[0] 5[1] 7[2] and 1[3] 5[1] 7[4] are different, but 1[0] 5[1] 7[2] and 7[2] 1[0] 5[1] are the same(duplicate)
My algorithm is a little slow when i work with a lot of values(example 1.000.000). So what i want is a faster way to do it.
for(unsigned int x = 0;x<vect.size()-2;x++){
for(unsigned int y = x+1;y<vect.size()-1;y++){
for(unsigned int z = y+1;z<vect.size();z++)
{
// do thing with vect[x],vect[y],vect[z]
}
}
}
In fact it is very very important that your values are between 1 and 100! Because with a vector of size 1,000,000 you have a lot of numbers that are equal and you don't need to inspect all of them! What you can do is the following:
Note: the following code is just an outline! It may lack sufficient error checking and is just here to give you the idea, not for copy paste!
Note2: When I wrote the answer, I assumed the numbers to be in the range [0, 99]. Then I read that they are actually in [1, 100]. Obviously this is not a problem and you can either -1 all the numbers or even better, change all the 100s to 101s.
bool exists[100] = {0}; // exists[i] means whether i exists in your vector
for (unsigned int i = 0, size = vect.size(); i < size; ++i)
exists[vect[i]] = true;
Then, you do similar to what you did before:
for(unsigned int x = 0; x < 98; x++)
if (exists[x])
for(unsigned int y = x+1; y < 99; y++)
if (exists[y])
for(unsigned int z = y+1; z < 100; z++)
if (exists[z])
{
// {x, y, z} is an answer
}
Another thing you can do is spend more time in preparation to have less time generating the pairs. For example:
int nums[100]; // from 0 to count are the numbers you have
int count = 0;
for (unsigned int i = 0, size = vect.size(); i < size; ++i)
{
bool exists = false;
for (int j = 0; j < count; ++j)
if (vect[i] == nums[j])
{
exists = true;
break;
}
if (!exists)
nums[count++] = vect[i];
}
Then
for(unsigned int x = 0; x < count-2; x++)
for(unsigned int y = x+1; y < count-1; y++)
for(unsigned int z = y+1; z < count; z++)
{
// {nums[x], nums[y], nums[z]} is an answer
}
Let us consider 100 to be a variable, so let's call it k, and the actual numbers present in the array as m (which is smaller than or equal to k).
With the first method, you have O(n) preparation and O(m^2*k) operations to search for the value which is quite fast.
In the second method, you have O(nm) preparation and O(m^3) for generation of the values. Given your values for n and m, the preparation takes too long.
You could actually merge the two methods to get the best of both worlds, so something like this:
int nums[100]; // from 0 to count are the numbers you have
int count = 0;
bool exists[100] = {0}; // exists[i] means whether i exists in your vector
for (unsigned int i = 0, size = vect.size(); i < size; ++i)
{
if (!exists[vect[i]])
nums[count++] = vect[i];
exists[vect[i]] = true;
}
Then:
for(unsigned int x = 0; x < count-2; x++)
for(unsigned int y = x+1; y < count-1; y++)
for(unsigned int z = y+1; z < count; z++)
{
// {nums[x], nums[y], nums[z]} is an answer
}
This method has O(n) preparation and O(m^3) cost to find the unique triplets.
Edit: It turned out that for the OP, the same number in different locations are considered different values. If that is really the case, then I'm sorry, there is no faster solution. The reason is that all the possible combinations themselves are C(n, m) (That's a combination) that although you are generating each one of them in O(1), it is still too big for you.
There's really nothing that can be done to speed up the loop body you have there. Consider that with 1M vector size, you are making one trillion loop iterations.
Producing all combinations like that is an exponential problem, which means that you won't be able to practically solve it when the input size becomes large enough. Your only option would be to leverage specific knowledge of your application (what you need the results for, and how exactly they will be used) to "work around" the issue if possible.
Possibly you can sort your input, make it unique, and pick x[a], x[b] and x[c] when a < b < c. The sort will be O(n log n) and picking the combination will be O(n³). Still you will have less triplets to iterate over:
std::vector<int> x = original_vector;
std::sort(x.begin(), x.end());
std::erase(std::unique(x.begin(), x.end()), x.end());
for(a = 0; a < x.size() - 2; ++a)
for(b=a+1; b < x.size() - 1; ++b)
for(c=b+1; c< x.size(); ++c
issue triplet(x[a],x[b],x[c]);
Depending on your actual data, you may be able to speed it up significantly by first making a vector that has at most three entries with each value and iterate over that instead.
As r15habh pointed out, I think the fact that the values in the array are between 1-100 is in fact important.
Here's what you can do: make one pass through the array, reading values into a unique set. This by itself is O(n) time complexity. The set will have no more than 100 elements, which means O(1) space complexity.
Now since you need to generate all 3-item permutations, you'll still need 3 nested loops, but instead of operating on the potentially huge array, you'll be operating on a set that has at most 100 elements.
Overall time complexity depends on your original data set. For a small data set, time complexity will be O(n^3). For a large data set, it will approach O(n).
If understand your application correctly then you can use a tuple instead, and store in either a set or hash table depending on your requirements. If the normal of the tri matters, then make sure that you shift the tri so that lets say the largest element is first, if normal shouldn't matter, then just sort the tuple. A version using boost & integers:
#include <set>
#include <algorithm>
#include "boost/tuple/tuple.hpp"
#include "boost/tuple/tuple_comparison.hpp"
int main()
{
typedef boost::tuple< int, int, int > Tri;
typedef std::set< Tri > TriSet;
TriSet storage;
// 1 duplicate
int exampleData[4][3] = { { 1, 2, 3 }, { 2, 3, 6 }, { 5, 3, 2 }, { 2, 1, 3 } };
for( unsigned int i = 0; i < sizeof( exampleData ) / sizeof( exampleData[0] ); ++i )
{
std::sort( exampleData[i], exampleData[i] + ( sizeof( exampleData[i] ) / sizeof( exampleData[i][0] ) ) );
if( !storage.insert( boost::make_tuple( exampleData[i][0], exampleData[i][1], exampleData[i][2] ) ).second )
std::cout << "Duplicate!" << std::endl;
else
std::cout << "Not duplicate!" << std::endl;
}
}
I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)
A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};
If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?
(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}
You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.
A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.
One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}
First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.
You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.
I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.
Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).