Golomb sequence without using array

Golomb sequence without using array - c++

Hi guys I'm looking for a program to find nth number of Golomb sequence without using array!!!!
**
I know this below program, But it's so so slow...
#include <bits/stdc++.h>
using namespace std;
int findGolomb(int);
int main()
{
int n;
cin >> n;
cout << findGolomb(n);
return 0;
}
int findGolomb(int n)
{
if (n == 1)
return 1;
else
return 1 + findGolomb(n - findGolomb(findGolomb(n - 1)));
}

It depends on how large a value you want to calculate. For n <= 50000, the following works:
#include <cmath>
/*
*/
round(1.201*pow(n, 0.618));
As it turns out, due to the nature of this sequence, you need almost every single entry in it to compute g[n]. I coded up a solution that uses a map to save past calculations, purging it of unneeded values. For n == 500000, the map still had roughly 496000 entries, and since a map has two values where the array would have one, you end up using about twice as much memory.
#include <iostream>
#include <map>
using namespace std;
class Golomb_Generator {
public:
int next() {
if (n == 1)
return cache[n++] = 1;
int firstTerm = n - 1;
int secondTerm = cache[firstTerm];
int thirdTerm = n - cache[secondTerm];
if (n != 3) {
auto itr = cache.upper_bound(secondTerm - 1);
cache.erase(begin(cache), itr);
}
return cache[n++] = 1 + cache[thirdTerm];
}
void printCacheSize() {
cout << cache.size() << endl;
}
private:
int n = 1;
map<int, int> cache;
};
void printGolomb(long long n)
{
Golomb_Generator g{};
for (int i = 0; i < n - 1; ++i)
g.next();
cout << g.next() << endl;
g.printCacheSize();
}
int main()
{
int n = 500000;
printGolomb(n);
return EXIT_SUCCESS;
}
You can guess as much. n - g(g(n - 1)) uses g(n-1) as an an argument to g, which is always much, much smaller than n. At the same time, the recurrence also uses n - 1 as an argument, which is close to n. You can't delete that many entries.
About the best you can do without O(n) memory is recursion combined with the approximation that is accurate for smaller n, but it will still become slow quickly. Additionally, as the recursive calls stack up, you will likely use more memory than having an appropriately sized array would.
You might be able to do a little better though. The sequence grows very slowly. Applying that fact to g(n - g(g(n - 1))), you can convince yourself that this relationship mostly needs stored values nearer to 1 and stored values nearer to n -- nearN(n - near1(nearN(n - 1))). You can have a tremendous swath in between that do not need to be stored, because they would be used in calculations of g(n) for much, much larger n than you care about. Below is an example of maintaining the first 10000 values of g and the last 20000 values of g. It works at least for n <= 2000000, and it stops working for sure at n >= 2500000. For n == 2000000, it takes about 5 to 10 seconds to compute.
#include <iostream>
#include <unordered_map>
#include <cmath>
#include <map>
#include <vector>
using namespace std;
class Golomb_Generator {
public:
int next() {
return g(n++);
}
private:
int n = 1;
map<int, int> higherValues{};
vector<int> lowerValues{1, 1};
int g(int n) {
if(n == 1)
return 1;
if (n <= 10000) {
lowerValues.push_back(1 + lowerValues[n - lowerValues[lowerValues[n - 1]]]);
return higherValues[n] = lowerValues[n];
}
removeOldestResults();
return higherValues[n] = 1 + higherValues[n - lowerValues[higherValues[n - 1]]];
}
void removeOldestResults() {
while(higherValues.size() >= 20000)
higherValues.erase(higherValues.begin());
}
};
void printGolomb(int n)
{
Golomb_Generator g{};
for (int i = 0; i < n - 1; ++i)
g.next();
cout << g.next() << endl;
}
int main()
{
int n = 2000000;
printGolomb(n);
return EXIT_SUCCESS;
}

There are some choices and considerations regarding the runtime.
move the complexity to math
The algorithm actually is nothing else but math in computer language. The algorithm may be improved by mathematical substitutions. You may look into the research regarding this algorithm and may find a better algorithm substitute.
move the complexity to the compiler.
When calling the findGolomb(12) with a specific number known at compile time, we may use constexpr, to move the calculation time to the compiler.
constexpr int findGolomb(int);
move the complexity to the memory
Although requested by the Question to not use an array, this is a considerable constraint. Without using any additional memory space, the algorithm has no options but to use runtime, to for example, to known already computed values of findGolomb(..).
The memory constraint may also include the size of the compiled program (by additional lines of code).
move the complexity to the runtime
When not using math, compiler or memory to enhance the algorithm, there is left no options but to move the complexity to the runtime.
Summarizing, there won't be any options to improve the runtime without the four options above. Removing compiler and memory optimizations, and considering the current runtime as already optimal, you are only left with math and research.

Related

My QuickSort code does not work for 1000000+ elements (one million elements or more)

I attempted to make my own sorting algorithm (call it MySort for now) and benchmark it against the sorting times of QuickSort. I use a random number generator to make an input file containing n random numbers, then provide this file as input to both MySort and QuickSort, and use std::chrono to time the time they take individually.
(At first I used an online compiler to check the times, but when I hit the limit of 10000 characters as input, I switched to doing it myself on my PC.)
So, for the first few tries (100 elements, 1000 elements, 10000 elements, 100000 elements), everything is working fine. I am getting a proper output time for the amount of time each sorting algorithm takes, but when I try to use 1000000 elements, QuickSort just doesn't give any output (does not seem to work at all), which is strange, because MySort worked just fine. I don't think it is a space issue, since MySort uses 2n additional space and works just fine.
The implementation of QuickSort I am using is given below:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
void quick_sort(int[],int,int);
int partition(int[],int,int);
int main()
{
int n,i;
cin>>n;
int a[n];
for(i=0;i<n;i++)
cin>>a[i];
auto start = high_resolution_clock::now();
quick_sort(a,0,n-1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout<<"Time taken = "<<d.count()<<endl;
/*
cout<<"\nArray after sorting:";
for(i=0;i<n;i++)
cout<<a[i]<<endl;
*/
return 0;
}
void quick_sort(int a[],int l,int u)
{
int j;
if(l<u)
{
j=partition(a,l,u);
quick_sort(a,l,j-1);
quick_sort(a,j+1,u);
}
}
int partition(int a[],int l,int u)
{
int v,i,j,temp;
v=a[l];
i=l;
j=u+1;
do
{
do
i++;
while(a[i]<v&&i<=u);
do
j--;
while(v<a[j]);
if(i<j)
{
temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}while(i<j);
a[l]=a[j];
a[j]=v;
return(j);
}
I tried looking around for solutions as to why it refuses to work for a million elements, but found nothing, besides the possibility that it may be a space issue, which seems unlikely to me considering MySort is working.
As for what exactly I get as output on feeding 1000000 elements in, when I execute both files on the command line, the output I get is (both run twice):
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 512129
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 516131
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>
However, if I run them both for only 100000 elements each, this is what I get:
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 76897.1
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 74019.4
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 16880.2
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 18005.3
C:\Users\Zac\Desktop>
Seems to be working fine.
I am at my wits end, any suggestions would be wonderful.

cin>>n;
int a[n];
This is your bug. You should never do this for three reasons.
This is not valid C++. In C++, the dimension of any array should be a constant expression. You are fooled by a non-conformant extension of gcc. Your code will fail to compile with other compilers. You should always use gcc (and clang) in high conformance mode. For C++, it would be g++ -std=c++17 -Wall -pedantic-errors
A large array local to a function is likely to provoke a stack overflow, since local variables are normally allocated on the stack and stack memory is usually very limited.
C-style arrays are bad, mkay? They don't know their own size, they cannot be easily checked for out-of-bounds access (std::vector and std::array have at() bounds-checking member functions), and they cannot be assigned or passed to functions or returned from functions. Use std::vector instead (or maybe std::array when the size is known in advance).

Let's remove the VLA's you're using and use std::vector. Here is what the code looks like with a sample data of 10 items (but with a check for boundary conditions).
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
using namespace std::chrono;
using vint = std::vector<int>;
void quick_sort(vint&, int, int);
int partition(vint&, int, int);
int main()
{
int n = 10, i;
vint a = { 7, 43, 2, 1, 6, 34, 987, 23, 0, 6 };
auto start = high_resolution_clock::now();
quick_sort(a, 0, n - 1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout << "Time taken = " << d.count() << endl;
return 0;
}
void quick_sort(vint& a, int l, int u)
{
int j;
if (l < u)
{
j = partition(a, l, u);
quick_sort(a, l, j - 1);
quick_sort(a, j + 1, u);
}
}
int partition(vint& a, int l, int u)
{
int v, i, j, temp;
v = a[l];
i = l;
j = u + 1;
do
{
do
i++;
while (a.at(i) < v&&i <= u);
do
j--;
while (v < a[j]);
if (i < j)
{
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
} while (i < j);
a[l] = a[j];
a[j] = v;
return(j);
}
Live Example.
You see that a std::out_of_range error is thrown on the line with the std::vector.at() call.
Bottom line -- your code was flawed to begin with -- whether it was 10, 100, or a million items. You are going out of bounds, thus the behavior is undefined. Usage of std::vector and at() detected the error, something that VLA's will not give you.

Besides VLA, your Quicksort always choose pivot as the first one. This may lead it to perform bad for worst cases. I don't know your output.txt but if the array has been already sorted, it runs O(n^2) because every partitioning would split into one element and the rest(half and half is the best). I think this is why it does not give any outputs for big inputs.
So I would suggest a couple of pivot-choosing heuristics that are commonly used.
Choose it randomly
Choose the median from the 3 elements - lowest/middle/highest index (a[l] / v[(l+u)/2] / v[u])
Once you choose a pivot, you can just simply swap it with v[lo] which minimizes your code changes.

Sets and Vectors. Are sets fast in C++?

Please read the question here - http://www.spoj.com/problems/MRECAMAN/
The question was to compute the recaman's sequence where, a(0) = 0 and, a(i) = a(i-1)-i if, a(i-1)-i > 0 and does not come into the sequence before else, a(i) = a(i-1) + i.
Now when I use vectors to store the sequence, and use the find function, the program times out. But when I use an array and a set to see if the element exists, it gets accepted (very fast). IS using set faster?
Here are the codes:
Vector implementation
vector <int> sequence;
sequence.push_back(0);
for (int i = 1; i <= 500000; i++)
{
a = sequence[i - 1] - i;
b = sequence[i - 1] + i;
if (a > 0 && find(sequence.begin(), sequence.end(), a) == sequence.end())
sequence.push_back(a);
else
sequence.push_back(b);
}
Set Implementation
int a[500001]
set <int> exists;
a[0] = 0;
for (int i = 1; i <= MAXN; ++i)
{
if (a[i - 1] - i > 0 && exists.find(a[i - 1] - i) == exists.end()) a[i] = a[i - 1] - i;
else a[i] = a[i - 1] + i;
exists.insert(a[i]);
}

Lookup in an std::vector:
find(sequence.begin(), sequence.end(), a)==sequence.end()
is an O(n) operation (n being the number of elements in the vector).
Lookup in an std::set (which is a balanced binary search tree):
exists.find(a[i-1] - i) == exists.end()
is an O(log n) operation.
So yes, lookup in a set is (asymptotically) faster than a linear lookup in vector.

If you can sort the vector, the look up is faster in most cases than in set because it is much more cache friendly.

There is only one valid answer to most "Is XY faster than UV in C++" questions:
Use a profiler.
While most algorithms (including container insertions, searches etc.) have a guaranteed complexity, these complexities can only tell you about the approximate behavior for large amounts of data. The performance for any given smaller set of data can not be easily compared, and the optimizations that a compiler can apply can not be reasonably guessed by humans. So use a profiler and see what is faster. If it matters at all. To see if performance matters in that special part of your program, use a profiler.
However, in your case it might be a safe bet that searching a set of ~250k elements can be faster than searching an unsorted vector of tat size. However, if you use the vector only for storing the inserted values and leave the sequence[i-1] out in a separate variable, you can keep the vector sorted and use an algorithm for sorted ranges like binary_search, which can be way faster than the set.
A sample implementation with a sorted vector:
const static size_t NMAX = 500000;
vector<int> values = {0};
values.reserve(NMAX );
int lastInserted = 0;
for (int i = 1; i <= NMAX) {
auto a = lastInserted - i;
auto b = lastInserted + i;
auto iter = lower_bound(begin(values), end(values), a);
//a is always less than the last inserted value, so iter can't be end(values)
if (a > 0 && a < *iter) {
lastInserted = a;
}
else {
//b > a => lower_bound(b) >= lower_bound(a)
iter = lower_bound(iter, end(values), b);
lastInserted = b;
}
values.insert(iter, lastInserted);
}
I hope I did not introduce any bugs...

For the task at hand, set is faster than vector because it keeps its contents sorted and does a binary search to find a specified item, giving logarithmic complexity instead of linear complexity. When the set is small, that difference is also small, but when the set gets large the difference grows considerably. I think you can improve things a bit more than just that though.
First, I'd avoid the clumsy lookup to see if an item is already present by just attempting to insert an item, then see if that succeeded:
if (b>0 && exists.insert(b).second)
a[i] = b;
else {
a[i] = c;
exists.insert(c);
}
This avoids looking up the same item twice, once to see if it was already present, and again to insert the item. It only does a second lookup when the first one was already present, so we're going to insert some other value.
Second, and even more importantly, you can use std::unordered_set to improve the complexity from logarithmic to (expected) constant. Since unordered_set uses (mostly) the same interface as std::set, this substitution is easy to make (including the optimization above.
Here's some code to compare the three methods:
#include <iostream>
#include <string>
#include <set>
#include <unordered_set>
#include <vector>
#include <numeric>
#include <chrono>
static const int MAXN = 500000;
unsigned original() {
static int a[MAXN+1];
std::set <int> exists;
a[0] = 0;
for (int i = 1; i <= MAXN; ++i)
{
if (a[i - 1] - i > 0 && exists.find(a[i - 1] - i) == exists.end()) a[i] = a[i - 1] - i;
else a[i] = a[i - 1] + i;
exists.insert(a[i]);
}
return std::accumulate(std::begin(a), std::end(a), 0U);
}
template <class container>
unsigned reduced_lookup() {
container exists;
std::vector<int> a(MAXN + 1);
a[0] = 0;
for (int i = 1; i <= MAXN; ++i) {
int b = a[i - 1] - i;
int c = a[i - 1] + i;
if (b>0 && exists.insert(b).second)
a[i] = b;
else {
a[i] = c;
exists.insert(c);
}
}
return std::accumulate(std::begin(a), std::end(a), 0U);
}
template <class F>
void timer(F f) {
auto start = std::chrono::high_resolution_clock::now();
std::cout << f() <<"\t";
auto stop = std::chrono::high_resolution_clock::now();
std::cout << "Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() << " ms\n";
}
int main() {
timer(original);
timer(reduced_lookup<std::set<int>>);
timer(reduced_lookup<std::unordered_set<int>>);
}
Note how std::set and std::unordered_set provide similar enough interfaces that I've written the code as a single template that can use either type of container, then for timing just instantiated that for both set and unordered_set.
Anyway, here's some results from g++ (version 4.8.1, compiled with -O3):
212972756 Time: 137 ms
212972756 Time: 101 ms
212972756 Time: 63 ms
Changing the lookup strategy improves speed by about 30%1 and using unordered_set with the improved lookup strategy better than doubles the speed compared to the original--not bad, especially when the result actually looks cleaner, at least to me. You might not agree that it's cleaner looking, but I think we can at least agree that I didn't write code that was a lot longer or more complex to get the speed improvement.
1. Simplistic analysis indicates that it should be around 25%. Specifically, if we assume there are even odds of a given number being in the set already, then this eliminates half the lookups about half the time, or about 1/4th of the lookups.

The set is a huge speedup because it's faster to look up. (Btw, exists.count(a) == 0 is prettier than using find.)
That doesn't have anything to do with vector vs array though. Adding the set to the vector version should work just as fine.

It is classic space-time tradeoff. When you use only vector your program uses minimum memory but you should to find existing numbers on every step. It is slowly. When you use additional index data structure (like a set in your case) you dramatically speed up your code but your code now takes at least twice greater memory. More about tradeoff here.

Need advice on improving my code: Search Algorithm

I'm pretty new at C++ and would need some advice on this.
Here I have a code that I wrote to measure the number of times an arbitrary integer x occurs in an array and to output the comparisons made.
However I've read that by using multi-way branching("Divide and conqurer!") techniques, I could make the algorithm run faster.
Could anyone point me in the right direction how should I go about doing it?
Here is my working code for the other method I did:
#include <iostream>
#include <cstdlib>
#include <vector>
using namespace std;
vector <int> integers;
int function(int vectorsize, int count);
int x;
double input;
int main()
{
cout<<"Enter 20 integers"<<endl;
cout<<"Type 0.5 to end"<<endl;
while(true)
{
cin>>input;
if (input == 0.5)
break;
integers.push_back(input);
}
cout<<"Enter the integer x"<<endl;
cin>>x;
function((integers.size()-1),0);
system("pause");
}
int function(int vectorsize, int count)
{
if(vectorsize<0) //termination condition
{
cout<<"The number of times"<< x <<"appears is "<<count<<endl;
return 0;
}
if (integers[vectorsize] > x)
{
cout<< integers[vectorsize] << " > " << x <<endl;
}
if (integers[vectorsize] < x)
{
cout<< integers[vectorsize] << " < " << x <<endl;
}
if (integers[vectorsize] == x)
{
cout<< integers[vectorsize] << " = " << x <<endl;
count = count+1;
}
return (function(vectorsize-1,count));
}
Thanks!

If the array is unsorted, just use a single loop to compare each element to x. Unless there's something you're forgetting to tell us, I don't see any need for anything more complicated.
If the array is sorted, there are algorithms (e.g. binary search) that would have better asymptotic complexity. However, for a 20-element array a simple linear search should still be the preferred strategy.

If your array is a sorted one you can use a divide to conquer strategy:
Efficient way to count occurrences of a key in a sorted array

A divide and conquer algorithm is only beneficial if you can either eliminate some work with it, or if you can parallelize the divided work parts accross several computation units. In your case, the first option is possible with an already sorted dataset, other answers may have addressed the problem.
For the second solution, the algorithm name is map reduce, which split the dataset in several subsets, distribute the subsets to as many threads or processes, and gather the results to "compile" them (the term is actually "reduce") in a meaningful result. In your setting, it means that each thread will scan its own slice of the array to count the items, and return its result to the "reduce" thread, which will add them up to return the final result. This solution is only interesting for large datasets though.
There are questions dealing with mapreduce and c++ on SO, but I'll try to give you a sample implementation here:
#include <utility>
#include <thread>
#include <boost/barrier>
constexpr int MAP_COUNT = 4;
int mresults[MAP_COUNT];
boost::barrier endmap(MAP_COUNT + 1);
void mfunction(int start, int end, int rank ){
int count = 0;
for (int i= start; i < end; i++)
if ( integers[i] == x) count++;
mresult[rank] = count;
endmap.wait();
}
int rfunction(){
int count = 0;
for (int i : mresults) {
count += i;
}
return count;
}
int mapreduce(){
vector<thread &> mthreads;
int range = integers.size() / MAP_COUNT;
for (int i = 0; i < MAP_COUNT; i++ )
mthreads.push_back(thread(bind(mfunction, i * range, (i+1) * range, i)));
endmap.wait();
return rfunction();
}
Once the integers vector has been populated, you call the mapreduce function defined above, which should return the expected result. As you can see, the implementation is very specialized:
the map and reduce functions are specific to your problem,
the number of threads used for map is static,
I followed your style and used global variables,
for convenience, I used a boost::barrier for synchronization
However this should give you an idea of the algorithm, and how you could apply it to similar problems.
caveat: code untested.

Improvements for isPrime function

There are many problems on the internet that require you to find prime numbers, so I decided to write a set of functions to find them. I used the Sieve of Eratosthenes for generating the primes as it was fast and easy to implement compared to other algorithms. However, I'm wondering if my code rather than my method is inefficient. Am I using STL containers/iterators right? Is there any section in my code slowing down the program?
In other words it does calculate the results correctly, but what I wonder about is whether its efficiency can be improved by some algorithmic improvement as opposed to just some code tweaking.
Any help is truly appreciated.
Here's my code
(I apologize if it's hard to read)
#include <iostream>
#include <set>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;
#define initial_prime_barrier 100
bool isFlagged(int i) { return i == 0; }
bool isNextStart(int i) { return i != 0; }
vector<int> generatePrimesBelow(int limit)
{
vector<int> primes;
for (int i = 2; i < limit; i++)
{
primes.push_back(i);
}
vector<int>::iterator currentStart = primes.begin();
do
{
int numberAtStart = *currentStart;
vector<int>::iterator currentNumber = currentStart + numberAtStart;
do
{
*currentNumber = 0;
advance(currentNumber, numberAtStart);
} while (currentNumber < primes.end());
currentStart = find_if(currentStart + 1, primes.end(), isNextStart);
} while ((*currentStart) * (*currentStart) < limit);
vector<int>::iterator newEnd = remove_if(primes.begin(), primes.end(), isFlagged);
primes.erase(newEnd, primes.end());
return primes;
}
bool isPrime(int number)
{
static vector<int> primes = generatePrimesBelow(initial_prime_barrier);
static int numPrimes = primes.size();
static int largestPrime = primes[numPrimes-1];
static int halfwayPrime = primes[numPrimes/2];
if (number == largestPrime)
{
return true;
}
else if (number < largestPrime)
{
if (number == halfwayPrime)
{
return true;
}
else if (number > halfwayPrime)
{
for (int i = numPrimes/2; i < numPrimes; i++)
{
if (number == primes[i])
{
return true;
}
}
}
else if (number < halfwayPrime)
{
for (int i = numPrimes/2; i >= 0; i--)
{
if (number == primes[i])
{
return true;
}
}
}
}
else if (number > largestPrime)
{
primes = generatePrimesBelow(number + number);
numPrimes = primes.size();
largestPrime = primes[numPrimes-1];
halfwayPrime = primes[numPrimes/2];
return isPrime(number);
}
return false;
}
int main (int argc, char * const argv[])
{
const int number = 123123;
cout << (isPrime(number) ? "YES" : "NO") << endl;
}

Yes, it is your method. Several things. You don't need your array to hold numbers, each entry's address in the array is the number itself. You just need them to hold two values - true and false. So make your array vector<bool>, it will be much more compact. Then, in your inner loop you start from x+x and advance by steps of x. You should start from x*x, and advance by steps of 2*x - that will work for all x except 2. Make it a special case, or mark these even numbers at the initialization loop. Or treat an entry at i as representing the number 2*i+1 and dispense with handling evens altogether. This should speed up your sieve code. Lastly, you don't need special find_if call with all its machinery, you can just check the current entry that comes up in the loop.
(edit:) In your isPrime you perform a binary search by hand, but there is already a binary_search algo in STL. And you won't need it at all, if you keep your vector<bool> sieve array as is, without compressing. Then isPrime(i) needs just to check whether the array's value at the index i is still true.
(edit2:) Now, about efficiency. You recalculate up to n+n, probably in anticipation of more numbers to test. If you only test few, simple trial division on odds will be faster. If the numbers to test are all in a narrow-ish upper region, your best option is offset sieve with the lower sieve done up to the sqrt of the test region's upper limit. And if the numbers are widely distributed, then your current whole array approach can be used.
The key facts to use here is that there are approximately n ~= m/log m primes below m in value, that to sieve an array from 0 to m takes O(m*log (log m)) time, and that to sieve the upper region between a and b, i.e. with width d=b-a, by all the primes below r=sqrt b, it'd take time proportional to d*log (log r).
Also, when growing your sieve array it is best to expand, and not to recalculate the whole anew. The primes are all there. To sieve the appendage it will be necessary to loop through all the primes in the sieve array up to the sqrt of its new upper edge. This is reminiscent of segmented sieve, although there each new segment comes instead of, or in any case separately from a previous one.

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)

A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};

If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?

(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}

You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.

A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.

One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}

First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.

You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.

I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.

Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js