calculate prefix sum - c++

I have following code to accomplish prefix sum task:
#include <iostream>
#include<math.h>
using namespace std;
int Log(int n){
int count=1;
while (n!=0){
n>>=1;
count++;
}
return count;
}
int main(){
int x[16]={39,21,20,50,13,18,2,33,49,39,47,15,30,47,24,1};
int n=sizeof(x)/sizeof(int );
for (int i=0;i<=(Log(n)-1);i++){
for (int j=0;j<=n-1;j++){
if (j>=(std::powf(2,i))){
int t=powf(2,i);
x[j]=x[j]+x[j-t];
}
}
}
for (int i=0;i<n;i++)
cout<<x[i]<< " ";
return 0;
}
From this wikipedia page
but i have got wrong result what is wrong? please help

I’m not sure what your code is supposed to do but implementing a prefix sum is actually pretty easy. For example, this calculates the (exclusive) prefix sum of an iterator range using an arbitrary operation:
template <typename It, typename F, typename T>
inline void prescan(It front, It back, F op, T const& id) {
if (front == back) return;
typename iterator_traits<It>::value_type accu = *front;
*front++ = id;
for (; front != back; ++front) {
swap(*front, accu);
accu = op(accu, *front);
}
}
This follows the interface style of the C++ standard library algorithms.
To use it from your code, you could write the following:
prescan(x, x + n, std::plus<int>());
Are you trying to implement a parallel prefix sum? This only makes sense when you actually parallelize your code. As it stands, your code is executed sequentially and doesn’t gain anything from the more complex divide and conquer logic that you seem to employ.
Furthermore, there are indeed errors in your code. The most important one:
for(int i=0;i<=(Log(n)-1);i++)
Here, you’re iterating until floor(log(n)) - 1. But the pseudo-code states that you in fact need to iterate until ceil(log(n)) - 1.
Furthermore, consider this:
for (int j=0;j<=n-1;j++)
This isn’t very usual. Normally, you’d write such code as follows:
for (int j = 0; j < n; ++j)
Notice that I used < instead of <= and adjusted the bounds from j - 1 to j. The same would hold for the outer loop, so you’d get:
for (int i = 0; i < std::log(n); ++i)
Finally, instead of std::powf, you can use the fact that pow(2, x) is the same as 1 << x (i.e. taking advantage of the fact that operations base 2 are efficiently implemented as bit operations). This means that you can simply write:
if (j >= 1 << i)
x[j] += x[j - (1 << i)];

I think that std::partial_sum does what you want
http://www.sgi.com/tech/stl/partial_sum.html

The quickest way to get your algorithm working: Drop the outer for(i...) loop, instead setting i to 0, and use only the inner for (j...) loop.
int main(){
...
int i=0;
for (int j=0;j<=n-1;j++){
if (j>=(powf(2,i))){
int t=powf(2,i);
x[j]=x[j]+x[j-t];
}
}
...
}
Or equivalently:
for (int j=0; j<=n-1; j++) {
if (j>=1)
x[j] = x[j] + x[j-1];
}
...which is the intuitive way to do a prefix sum, and also probably the fastest non-parallel algorithm.
Wikipedia's algorithm is designed to be run in parallel, such that all of the additions are completely independent of each other. It reads all the values in, adds to them, then writes them back into the array, all in parallel. In your version, when you execute x[j]=x[j]+x[j-t], you're using the x[j-t] that you just added to, t iterations ago.
If you really want to reproduce Wikipedia's algorithm, here's one way, but be warned it will be much slower than the intuitive way above, unless you are using a parallelizing compiler and a computer with a whole bunch of processors.
int main() {
int x[16]={39,21,20,50,13,18,2,33,49,39,47,15,30,47,24,1};
int y[16];
int n=sizeof(x)/sizeof(int);
for (int i=0;i<=(Log(n)-1);i++){
for (int j=0;j<=n-1;j++){
y[j] = x[j];
if (j>=(powf(2,i))){
int t=powf(2,i);
y[j] += x[j-t];
}
}
for (int j=0;j<=n-1;j++){
x[j] = y[j];
}
}
for (int i=0;i<n;i++)
cout<<x[i]<< " ";
cout<<endl;
}
Side notes: You can use 1<<i instead of powf(2,i), for speed. And as ergosys mentioned, your Log() function needs work; the values it returns are too high, which won't affect the partial sum's result in this case, but will make it take longer.

Related

What would be a better way to write a loop that loops by initial value

int square(int x) {
int looptime = x;
int total = 0;
for (int i=0 ; i < looptime; ++i) {
total += looptime;
}
return total;
}
int main()
{
for (int i = 0; i < 100; ++i)
cout << i << '\t' << square(i) << '\n';
}
I am new to C++ and trying to learn by self study by reading "Programming Principles and Practices". In this particular problem, I have to make a "square()" function. The catch is I have to use addition,no multiplication.
The above code works and returns correct values, but contrary to mantra of readability, I find it hard to read and understand and I'm the one that wrote it.
I need to pass into the loop, the initial integer and have it loop that many times without effecting the original amount. Is there a better way to write this.
There is no need for looptime in the function.
int square(int x) {
int total = 0;
for (int i=0 ; i < x; ++i) {
total += x;
}
return total;
}
In regards to why it was infinitely doubling, it is difficult without seeing the earlier code. The logical explanation would be that the logic condition in the for loop was flawed, so it never exited when expected.
In this portion, the logic likely would have been where i < looptime is now:
for (int i=0 ; i < looptime; ++i) {
total += looptime;
If that condition is always met, it would explain the symptom (it will remain in the loop, adding to it each cycle). If you can reproduce the earlier result and show the earlier code it would allow this theory to be proven out.

What is wrong with the logic in my program?

I've been tasked to create a function that identifies the number of occurrences in an array, however i am not getting the correct result. This is the function i wrote, i left out the rest of the program as that works.
int countOccurences(int b[], int size, int x)
{
int occ = x;
for(int i = 0; i < size; i++)
{
if(b[i] == occ)
occ++;
}
cout << occ << endl;
return occ;
}
If occ is meant to be the number of occurrences, it should be initialised to zero rather than x.
And the comparison should be between b[i] and x, not b[i] and occ.
And, as an aside (not affecting your actual logic), it's also very unusual to actually print out the return value in a utility function which is obviously meant to simply return the count but it may be you have that in there just for debug purposes.
And you should both ensure your indentation and use of braces is consistent between your for and your if - it will make your code easier to maintain.
That's all totally aside from the fact that C++ possesses a std::count() method in <algorithm> that will work this out for you without having to write a function to do it (although it may be that this is an educational question and the intent is to learn how to code things like this, rather than use readily made library functions to do the heavy lifting for you).
int countOccurences(int b[], const unsigned int size, const int x)
{
int occ = 0;
for(unsigned int i = 0; i < size; i++)
{
if(b[i] == x)
{
occ++;
}
}
std::cout << occ << std::endl;
return occ;
}
occ should start at zero
You should compare b[i] to x
Array indices should be unsigned
Why not be const-correct?
using namespace std; is bad practice

Trouble sieving primes from a large range

#include <cstdio>
#include <algorithm>
#include <cmath>
using namespace std;
int main() {
int t,m,n;
scanf("%d",&t);
while(t--)
{
scanf("%d %d",&m,&n);
int rootn=sqrt(double(n));
bool p[10000]; //finding prime numbers from 1 to square_root(n)
for(int j=0;j<=rootn;j++)
p[j]=true;
p[0]=false;
p[1]=false;
int i=rootn;
while(i--)
{
if(p[i]==true)
{
int c=i;
do
{
c=c+i;
p[c]=false;
}while(c+p[i]<=rootn);
}
};
i=0;
bool rangep[10000]; //used for finding prime numbers between m and n by eliminating multiple of primes in between 1 and squareroot(n)
for(int j=0;j<=n-m+1;j++)
rangep[j]=true;
i=rootn;
do
{
if(p[i]==true)
{
for(int j=m;j<=n;j++)
{
if(j%i==0&&j!=i)
rangep[j-m]=false;
}
}
}while(i--);
i=n-m;
do
{
if(rangep[i]==true)
printf("%d\n",i+m);
}while(i--);
printf("\n");
}
return 0;
system("PAUSE");
}
Hello I'm trying to use the sieve of Eratosthenes to find prime numbers in a range between m to n where m>=1 and n<=100000000. When I give input of 1 to 10000, the result is correct. But for a wider range, the stack is overflowed even if I increase the array sizes.
               
A simple and more readable implementation
void Sieve(int n) {
int sqrtn = (int)sqrt((double)n);
std::vector<bool> sieve(n + 1, false);
for (int m = 2; m <= sqrtn; ++m) {
if (!sieve[m]) {
cout << m << " ";
for (int k = m * m; k <= n; k += m)
sieve[k] = true;
}
}
for (int m = sqrtn; m <= n; ++m)
if (!sieve[m])
cout << m << " ";
}
Reason of getting error
You are declaring an enormous array as a local variable. That's why when the stack frame of main is pushed it needs so much memory that stack overflow exception is generated. Visual studio is tricky enough to analyze the code for projected run-time stack usage and generate exception when needed.
Use this compact implementation. Moreover you can have bs declared in the function if you want. Don't make implementations complex.
Implementation
typedef long long ll;
typedef vector<int> vi;
vi primes;
bitset<100000000> bs;
void sieve(ll upperbound) {
_sieve_size = upperbound + 1;
bs.set();
bs[0] = bs[1] = 0;
for (ll i = 2; i <= _sieve_size; i++)
if (bs[i]) { //if not marked
for (ll j = i * i; j <= _sieve_size; j += i) //check all the multiples
bs[j] = 0; // they are surely not prime :-)
primes.push_back((int)i); // this is prime
} }
call from main() sieve(10000);. You have primes list in vector primes.
Note: As mentioned in comment--stackoverflow is quite unexpected error here. You are implementing sieve but it will be more efficient if you use bistet instead of bool.
Few things like if n=10^8 then sqrt(n)=10^4. And your bool array is p[10000]. So there is a chance of accessing array out of bound.
I agree with the other answers,
saying that you should basically just start over. 
Do you even care why your code doesn’t work?  (You didn’t actually ask.)
I’m not sure that the problem in your code
has been identified accurately yet. 
First of all, I’ll add this comment to help set the context:
// For any int aardvark;
// p[aardvark] = false means that aardvark is composite (i.e., not prime).
// p[aardvark] = true means that aardvark might be prime, or maybe we just don’t know yet.
Now let me draw your attention to this code:
int i=rootn;
while(i--)
{
if(p[i]==true)
{
int c=i;
do
{
c=c+i;
p[c]=false;
}while(c+p[i]<=rootn);
}
};
You say that n≤100000000 (although your code doesn’t check that), so,
presumably, rootn≤10000, which is the dimensionality (size) of p[]. 
The above code is saying that, for every integer i
(no matter whether it’s prime or composite),
2×i, 3×i, 4×i, etc., are, by definition, composite. 
So, for c equal to 2×i, 3×i, 4×i, …,
we set p[c]=false because we know that c is composite.
But look closely at the code. 
It sets c=c+i and says p[c]=false
before checking whether c is still in range
to be a valid index into p[]. 
Now, if n≤25000000, then rootn≤5000. 
If i≤ rootn, then i≤5000, and, as long as c≤5000, then c+i≤10000. 
But, if n>25000000, then rootn>5000,†
and the sequence i=rootn;, c=i;, c=c+i;
can set c to a value greater than 10000. 
And then you use that value to index into p[]. 
That’s probably where the stack overflow occurs.
Oh, BTW; you don’t need to say if(p[i]==true); if(p[i]) is good enough.
To add insult to injury, there’s a second error in the same block:
while(c+p[i]<=rootn). 
c and i are ints,
and p is an array of bools, so p[i] is a bool —
and yet you are adding c + p[i]. 
We know from the if that p[i] is true,
which is numerically equal to 1 —
so your loop termination condition is while (c+1<=rootn);
i.e., while c≤rootn-1. 
I think you meant to say while(c+i<=rootn).
Oh, also, why do you have executable code
immediately after an unconditional return statement? 
The system("PAUSE"); statement cannot possibly be reached.
(I’m not saying that those are the only errors;
they are just what jumped out at me.)
______________
† OK, splitting hairs, n has to be ≥ 25010001
(i.e., 50012) before rootn>5000.

Need advice on improving my code: Search Algorithm

I'm pretty new at C++ and would need some advice on this.
Here I have a code that I wrote to measure the number of times an arbitrary integer x occurs in an array and to output the comparisons made.
However I've read that by using multi-way branching("Divide and conqurer!") techniques, I could make the algorithm run faster.
Could anyone point me in the right direction how should I go about doing it?
Here is my working code for the other method I did:
#include <iostream>
#include <cstdlib>
#include <vector>
using namespace std;
vector <int> integers;
int function(int vectorsize, int count);
int x;
double input;
int main()
{
cout<<"Enter 20 integers"<<endl;
cout<<"Type 0.5 to end"<<endl;
while(true)
{
cin>>input;
if (input == 0.5)
break;
integers.push_back(input);
}
cout<<"Enter the integer x"<<endl;
cin>>x;
function((integers.size()-1),0);
system("pause");
}
int function(int vectorsize, int count)
{
if(vectorsize<0) //termination condition
{
cout<<"The number of times"<< x <<"appears is "<<count<<endl;
return 0;
}
if (integers[vectorsize] > x)
{
cout<< integers[vectorsize] << " > " << x <<endl;
}
if (integers[vectorsize] < x)
{
cout<< integers[vectorsize] << " < " << x <<endl;
}
if (integers[vectorsize] == x)
{
cout<< integers[vectorsize] << " = " << x <<endl;
count = count+1;
}
return (function(vectorsize-1,count));
}
Thanks!
If the array is unsorted, just use a single loop to compare each element to x. Unless there's something you're forgetting to tell us, I don't see any need for anything more complicated.
If the array is sorted, there are algorithms (e.g. binary search) that would have better asymptotic complexity. However, for a 20-element array a simple linear search should still be the preferred strategy.
If your array is a sorted one you can use a divide to conquer strategy:
Efficient way to count occurrences of a key in a sorted array
A divide and conquer algorithm is only beneficial if you can either eliminate some work with it, or if you can parallelize the divided work parts accross several computation units. In your case, the first option is possible with an already sorted dataset, other answers may have addressed the problem.
For the second solution, the algorithm name is map reduce, which split the dataset in several subsets, distribute the subsets to as many threads or processes, and gather the results to "compile" them (the term is actually "reduce") in a meaningful result. In your setting, it means that each thread will scan its own slice of the array to count the items, and return its result to the "reduce" thread, which will add them up to return the final result. This solution is only interesting for large datasets though.
There are questions dealing with mapreduce and c++ on SO, but I'll try to give you a sample implementation here:
#include <utility>
#include <thread>
#include <boost/barrier>
constexpr int MAP_COUNT = 4;
int mresults[MAP_COUNT];
boost::barrier endmap(MAP_COUNT + 1);
void mfunction(int start, int end, int rank ){
int count = 0;
for (int i= start; i < end; i++)
if ( integers[i] == x) count++;
mresult[rank] = count;
endmap.wait();
}
int rfunction(){
int count = 0;
for (int i : mresults) {
count += i;
}
return count;
}
int mapreduce(){
vector<thread &> mthreads;
int range = integers.size() / MAP_COUNT;
for (int i = 0; i < MAP_COUNT; i++ )
mthreads.push_back(thread(bind(mfunction, i * range, (i+1) * range, i)));
endmap.wait();
return rfunction();
}
Once the integers vector has been populated, you call the mapreduce function defined above, which should return the expected result. As you can see, the implementation is very specialized:
the map and reduce functions are specific to your problem,
the number of threads used for map is static,
I followed your style and used global variables,
for convenience, I used a boost::barrier for synchronization
However this should give you an idea of the algorithm, and how you could apply it to similar problems.
caveat: code untested.

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)
A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};
If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?
(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}
You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.
A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.
One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}
First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.
You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.
I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.
Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).