Sieve of Eratosthenes has huge 'overdraw' - is Sundaram's better after all? - primes

The standard Sieve of Eratosthenes crosses out most composites multiple times; in fact the only ones that do not get marked more than once are those that are the product of exactly two primes. Naturally, the overdraw increases as the sieve gets bigger.
For an odd sieve (i.e. without the evens) the overdraw hits 100% for n = 3,509,227, with 1,503,868 composites and 1,503,868 crossings-out of already crossed-out numbers. For n = 2^32 the overdraw rises to 134.25% (overdraw 2,610,022,328 vs. pop count 1,944,203,427 = (2^32 / 2) - 203,280,221).
The Sieve of Sundaram - with another explanation at maths.org - may be a bit smarter about that, if - and only if - the loop limits are computed intelligently. However, that's something which the sources I've seen seem to gloss over as 'optimisation', and it also seems that an unoptimised Sundaram gets beaten by an odds-only Eratosthenes every time.
The interesting things is that both create exactly the same final bitmap, i.e. one where bit k corresponds to the number (2 * k + 1). So both algorithms must end up setting exactly the same bits, they just have different ways of going about it.
Does someone have hands-on experience with a competitive, tuned Sundaram? Can it beat the old Greek?
I have slimmed the code for my small factor sieve (2^32, an odds-only Greek) and tuned the segment size to 256 KB, which is as optimal on the old Nehalem with its 256 KB L2 as on the newer CPUs (even though the latter are much more forgiving about bigger segments). But now I have hit a brick wall and the bloody sieve still takes 8.5 s to initialise. Loading the sieve from hard disk is not a very attractive option, and multi-threading is difficult to do in a portable manner (since libs like boost tend to put a monkey wrench into portability)...
Can Sundaram shave a few seconds from the startup time?
P.S.: the overdraw as such is not a problem and will be absorbed by the L2 cache. The point is that the standard Eratosthenes seems to do more than double the work than necessary, which indicates that there may be potential for doing less work, faster.

Since there weren't any takers for the 'Sundaram vs. Eratosthenes' problem, I sat down and analysed it. Result: classic Sundaram's has strictly higher overdraw than an odds-only Eratosthenes; if you apply an obvious, small optimisation then the overdraw is exactly the same - for reasons that shall become obvious. If you fix Sundaram's to avoid overdraw entirely then you get something like Pritchard's Sieve, which is massively more complicated.
The exposition of Sundaram's Sieve in Lucky's Notes is probably the best by far; slightly rewritten to use a hypothetical (i.e. non-standard and not supplied here) type bitmap_t it looks somewhat like this. In order to measure overdraw the bitmap type needs an operation corresponding to the BTS (bit test and set) CPU instruction, which is available via the _bittestandset() intrinsic with Wintel compilers and with MinGW editions of gcc. The intrinsics are very bad for performance but very convenient for counting overdraw.
Note: for sieving all primes up to N one would call the sieve with max_bit = N/2; if bit i of the resulting bitmap is set then the number (2 * i + 1) is composite. The function has '31' in its name because the index math breaks for bitmaps greater than 2^31; hence this code can only sieve numbers up to 2^32-1 (corresponding to max_bit <= 2^31-1).
uint64_t Sundaram31_a (bitmap_t &bm, uint32_t max_bit)
{
assert( max_bit <= UINT32_MAX / 2 );
uint32_t m = max_bit;
uint64_t overdraw = 0;
bm.set_all(0);
for (uint32_t i = 1; i < m / 2; ++i)
{
for (uint32_t j = i; j <= (m - i) / (2 * i + 1); ++j)
{
uint32_t k = i + j + 2 * i * j;
overdraw += bm.bts(k);
}
}
return overdraw;
}
Lucky's bound on j is exact but the one on i is very loose. Tightening it and losing the m alias that I had added to make the code look more like common expositions on the net, we get:
uint64_t Sundaram31_b (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;
bm.set_all(0);
for (uint32_t i = 1; i <= i_max; ++i)
{
for (uint32_t j = i; j <= (max_bit - i) / (2 * i + 1); ++j)
{
uint32_t k = i + j + 2 * i * j;
overdraw += bm.bts(k);
}
}
return overdraw;
}
The assert was dumped in order to reduce noise but it is actually still valid and necessary. Now it's time for a bit of strength reduction, turning multiplication into iterated addition:
uint64_t Sundaram31_c (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;
bm.set_all(0);
for (uint32_t i = 1; i <= i_max; ++i)
{
uint32_t n = 2 * i + 1;
uint32_t k = n * i + i; // <= max_bit because that's how we computed i_max
uint32_t j_max = (max_bit - i) / n;
for (uint32_t j = i; j <= j_max; ++j, k += n)
{
overdraw += bm.bts(k);
}
}
return overdraw;
}
Transforming the loop condition to use k allows us to lose j; things should be looking exceedingly familiar by now...
uint64_t Sundaram31_d (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;
bm.set_all(0);
for (uint32_t i = 1; i <= i_max; ++i)
{
uint32_t n = 2 * i + 1;
uint32_t k = n * i + i; // <= max_bit because that's how we computed i_max
for ( ; k <= max_bit; k += n)
{
overdraw += bm.bts(k);
}
}
return overdraw;
}
With things looking the way they do, it's time to analyse whether a certain obvious small change is warranted by the math. The proof is left as an exercise for the reader...
uint64_t Sundaram31_e (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = unsigned(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;
bm.set_all(0);
for (uint32_t i = 1; i <= i_max; ++i)
{
if (bm.bt(i)) continue;
uint32_t n = 2 * i + 1;
uint32_t k = n * i + i; // <= m because we computed i_max to get this bound
for ( ; k <= max_bit; k += n)
{
overdraw += bm.bts(k);
}
}
return overdraw;
}
The only thing that still differs from classic odds-only Eratosthenes (apart from the name) is the initial value for k, which is normally (n * n) / 2 for the old Greek. However, substituting 2 * i + 1 for n the difference turns out to be 1/2, which rounds to 0. Hence, Sundaram's is odds-only Eratosthenes without the 'optimisation' of skipping composites to avoid at least some crossings-out of already crossed-out numbers. The value for i_max is the same as the Greek's max_factor_bit, just arrived at using completely different logical steps and computed using a marginally different formula.
P.S.: after seeing overdraw in the code so many times, people will probably want to know what it actually is... Sieving the numbers up to 2^32-1 (i.e. a full 2^31 bit bitmap) Sundaram's has an overdraw of 8,643,678,027 (roughly 2 * 2^32) or 444.6%; with the small fix that turns it into odds-only Eratosthenes the overdraw becomes 2,610,022,328 or 134.2%.

The sieve that avoid calculating redundant composites is the sieve of Marouane, check it out it may be faster the Sundaram Sieve

Related

how to further optimize this code using Openmp multithreading

i have this code snippet I came across and I'm trying to use OpenMP to make it run faster than the original version. However, in comparison this seems to be taking about the same amount of time as the older version. Not sure why this multithreading approach is not working to optimize it. Like the timings are still the same. What can I do to make it run even faster?:
void sobel(unsigned char *data_out,
unsigned char *data_in, unsigned height,
unsigned width)
{
/* Sobel matrices for convolution */
int sobelv[3][3] = { {-1, -2, -1}, {0, 0, 0}, {1, 2, 1} };
int sobelh[3][3] = { {-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1} };
unsigned int size, i, j;
int lay;
size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(64) shared (data_in,data_out,sobelv, sobelh,size) private (i,j,lay)
#endif
for (lay = 0; lay < 3; lay++) {
for (i = 1; i < height - 1; ++i) {
for (j = 1; j < width - 1; j++) {
int sumh, sumv;
int k = -1, l = -1;
sumh = 0;
sumv = 0;
/* Convolution part */
for ( k = -1; k < 2; k++)
for (l = -1; l < 2; l++) {
sumh =
sumh + sobelh[k + 1][l + 1] *(int) data_in[lay * size + (i + k) * width +(j + l)];
sumv =
sumv + sobelv[k + 1][l +1] * (int) data_in[lay *size +(i +k) *width + (j +l)];
}
int temp = abs(sumh / 8) + abs(sumv / 8);
data_out[lay * size + i * width + j] =
(temp > 255? 255: temp);
}
}
}
}
the main function is simply calling this function like this:
sobel(data_out, data_in, header.height, header.width);
any help would be appreciated!! :)
The best optimization you can apply is to vectorize the code. Compilers can often auto-vectorize the code when it is sufficiently simple but this one is too complex for most compilers (including GCC and Clang) to vectorize it.
Manual code vectorization is cumbersome error-prone and often make the code (more) dependent of a specific architecture (eg. x86-64). However, you can help the compiler to generate it for you. To do that, you it better to:
avoid mixing signed/unsigned types and type of different size;
use the smallest possible types fitting your needs;
avoid loops and conditions in the vectorized loop;
access data contiguously;
avoid integer multiplication/division with small types (on x86-64 and/or with some compilers);
prefer using local short-scoped variables when this is possible;
enable advanced optimizations like -O3 for GCC/Clang, possibly coupled with -mavx2 if your target platform supports the AVX-2 instruction set, or with -march=native if your target platform is the one where the program is built;
be careful about aliasing (possibly using temporary arrays, strict aliasing rules, memcpy calls, restrict compiler extensions, etc.) [thanks to #Laci].
You can check the generated assembly code to see if the code is vectorized or not.
Moreover, using collapse(2) should enough here to get a good speed-up. collapse(3) can introduce some unwanted overheads due to the last loop being shared amongst threads. collapse(64) is not correct (it cannot be bigger than the number of nested loops).
Here is the resulting untested code:
#include <cmath>
void sobel(unsigned char *data_out,
unsigned char *data_in, int height,
int width)
{
const int size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(2) shared(data_in,data_out,size)
#endif
for (int lay = 0; lay < 3; lay++)
{
for (int i = 1; i < height - 1; ++i)
{
for (int j = 1; j < width - 1; j++)
{
short a11 = data_in[lay * size + (i-1) * width + (j-1)];
short a12 = data_in[lay * size + (i-1) * width + j];
short a13 = data_in[lay * size + (i-1) * width + (j+1)];
short a21 = data_in[lay * size + i * width + (j-1)];
short a23 = data_in[lay * size + i * width + (j+1)];
short a31 = data_in[lay * size + (i+1) * width + (j-1)];
short a32 = data_in[lay * size + (i+1) * width + j];
short a33 = data_in[lay * size + (i+1) * width + (j+1)];
short sumh = a13 - a11 + (a23 - a21) + (a23 - a21) + a33 - a31;
short sumv = a31 + a32 + a32 + a33 - (a11 + a12 + a12 + a13);
short temp = (abs(sumh) >> 3) + (abs(sumv) >> 3);
data_out[lay * size + i * width + j] = (temp > 255? 255: temp);
}
}
}
}
I expect the code to be several time faster (especially true in sequential) -- typically about 10 times faster with AVX-2 since the processor can work on 16 values at once (despite a bit more work related to SIMD instructions).
Another possible optimization you can do is called register blocking. The idea is to change the loop so that you work on small fixed-size tiles (eg. 2x2 or 4x2 SIMD values). This should reduces the number of L1-cache loads and the number of char-to-short/short-to-char conversions to perform. However, this is hard to help the compiler so it does this optimization correctly on such a code. It is probably better to use SIMD intrinsics if performance is critical and do the register blocking yourself.

Need help understanding this line in an FFT algorithm

In my program I have a function that performs the fast Fourier transform. I know there are very good implementations freely available, but this is a learning thing so I don't want to use those. I ended up finding this comment with the following implementation (it originated from the Italian entry for the FFT):
void transform(complex<double>* f, int N) //
{
ordina(f, N); //first: reverse order
complex<double> *W;
W = (complex<double> *)malloc(N / 2 * sizeof(complex<double>));
W[1] = polar(1., -2. * M_PI / N);
W[0] = 1;
for(int i = 2; i < N / 2; i++)
W[i] = pow(W[1], i);
int n = 1;
int a = N / 2;
for(int j = 0; j < log2(N); j++) {
for(int k = 0; k < N; k++) {
if(!(k & n)) {
complex<double> temp = f[k];
complex<double> Temp = W[(k * a) % (n * a)] * f[k + n];
f[k] = temp + Temp;
f[k + n] = temp - Temp;
}
}
n *= 2;
a = a / 2;
}
free(W);
}
I've made a lot of changes by now but this was my starting point. One of the changes I made was to not cache the twiddle factors, because I decided to see if it's needed first. Now I've decided I do want to cache them. The way this implementation seems to do it is it has this array W of length N/2, where every index k has the value . What I don't understand is this expression:
W[(k * a) % (n * a)]
Note that n * a is always equal to N/2. I get that this is supposed to be equal to , and I can see that , which this relies on. I also get that modulo can be used here because the twiddle factors are cyclic. But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?
The twiddle factors are equally spaced points on the unit circle, and there is an even number of points because N is a power-of-two. After going around half of the circle (starting at 1, going counter clockwise above the X-axis), the second half is a repeat of the first half but this time it's below the X-axis (the points can be reflected through the origin). That is why Temp is subtracted the second time. That subtraction is the negation of the twiddle factor.

How to optimize this math operation for speed

I'm trying to optimize a function taking a good chunk of execution time, which computes the following math operation many times. Is there anyway to make this operation faster?
float total = (sqrt(
((point_A[j].length)*(point_A[j].length))+
((point_B[j].width)*(point_B[j].width))+
((point_C[j].height)*(point_C[j].height))
));
If memory is cheap then you could do the following thereby improving CPU cache hit rate. Since you haven't posted more details, so I will make some assumptions here.
long tmp_len_square[N*3];
for (int j = 0; j < N; ++j) {
tmp_len_square[3 * j] = (point_A[j].length)*(point_A[j].length);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 1] = (point_B[j].width)*(point_B[j].width);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 2] = (point_C[j].height)*(point_C[j].height);
}
for (int j = 0; j < N; ++j) {
float total = sqrt(tmp_len_square[3 * j] +
tmp_len_square[(3 * j) + 1] +
tmp_len_square[(3 * j) + 2]);
// ...
}
Rearrange the data into this:
float *pointA_length;
float *pointB_width;
float *pointC_height;
That may require some level of butchering of your data structures, so you'll have to choose whether it's worth it or not.
Now what we can do is write this:
void process_points(float* Alengths, float* Bwidths, float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i++) {
output[i] = sqrt(Alengths[i] * Alengths[i] +
Bwidths[i] * Bwidths[i] +
Cheights[i] * Cheights[i]);
}
}
Writing it like this allows it to be auto-vectorized. For example, GCC targeting AVX and with -fno-math-errno -ftree-vectorize, can vectorize that loop. It does that with a lot of cruft though. __restrict__ and alignment attributes only improve that a little. So here's a hand-vectorized version as well: (not tested)
void process_points(float* Alengths,
float* Bwidths,
float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i += 8) {
__m256 a = _mm256_load_ps(Alengths + i);
__m256 b = _mm256_load_ps(Bwidths + i);
__m256 c = _mm256_load_ps(Cheights + i);
__m256 asq = _mm256_mul_ps(a, a);
__m256 sum = _mm256_fmadd_ps(c, c, _mm256_fmadd_ps(b, b, asq));
__m256 hsum = _mm256_mul_ps(sum, _mm256_set1_ps(0.5f));
__m256 invsqrt = _mm256_rsqrt_ps(sum);
__m256 s = _mm256_mul_ps(invsqrt, invsqrt);
invsqrt = _mm256_mul_ps(sum, _mm256_fnmadd_ps(hsum, s, _mm256_set1_ps(1.5f)));
_mm256_store_ps(output + i, _mm256_mul_ps(sum, invsqrt));
}
}
This makes a number of assumptions:
all pointers are 32-aligned.
n is a multiple of 8, or at least the buffers have enough padding that they're never accessed out of bounds.
the input buffers are not aliased with the output buffer (they could be aliased among themselves, but .. why)
the slightly reduced accuracy of the square root computed this way is OK (accurate to approximately 22 bits, instead of correctly rounded).
the sum of squares computed with fmadd can be slightly different than if it's computed using multiplies and adds, I assume that's OK too
your target supports AVX/FMA so this will actually run
The method for computing the square root I used here is using an approximate reciprocal square root, an improvement step (y = y * (1.5 - (0.5 * x * y * y))) and then a multiplication by x because x * 1/sqrt(x) = x/sqrt(x) = sqrt(x).
You can eventually try to optimize the sqrt function itself. May I suggest you to have a look at this link:
Best Square Root Method
Your question could be improved by adding a little more context. Is your code required to be portable, or are you targeting a particular compiler, or a specific processor or processor family? Perhaps you're willing to accept a general baseline version with target-specific optimised versions selected at runtime?
Also, there's very little context for the line of code you give. Is it in a tight loop? Or is it scattered in a bunch of places in conditional code in such a loop?
I'm going to assume that it's in a tight loop thus:
for (int j=0; j<total; ++j)
length[j] = sqrt(
(point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
I'm also going to assume that your target processor is multi-core, and that the arrays are distinct (or that the relevant elements are distinct), then an easy win is to annotate for OpenMP:
#pragma omp parallel for
for (int j=0; j<total; ++j)
length[j] = sqrt((point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
Compile with g++ -O3 -fopenmp -march=native (or substitute native with your desired target processor architecture).
If you know your target, you might be able to benefit from parallelisation of loops with the gcc flag -ftree-parallelize-loops=n - look in the manual.
Now measure your performance change (I'm assuming that you measured the original, given that this is an optimisation question). If it's still not fast enough for you, then it's time to consider changing your data structures, algorithms, or individual lines of code.

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};
1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).
The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.
Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).
Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.
You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.
Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

Histogram approximation for streaming data

This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.
A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:
initialize a standard C array of size N, where N = number of bins; and
multiply the input value (real number) by some factor and floor the result to get its index in the C array.
This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.
Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.
This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)
So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?
Keep four variables.
int N; // assume for simplicity that N is even
int count[N];
double lower_bound;
double bin_size;
When a new sample x arrives, compute double i = floor(x - lower_bound) / bin_size. If i >= 0 && i < N, then increment count[i]. If i >= N, then repeatedly double bin_size until x - lower_bound < N * bin_size. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).
for (int j = 0; j < N / 2; j++) count[j] = count[2 * j] + count[2 * j + 1];
for (int j = N / 2; j < N; j++) count[j] = 0;
The case i < 0 is trickier, since we need to decrease lower_bound as well as increase bin_size (again, optimize for sparsity or adjust the counts in one step).
while (lower_bound > x) {
lower_bound -= N * bin_size;
bin_size += bin_size;
for (int j = N - 1; j > N / 2 - 1; j--) count[j] = count[2 * j - N] + count[2 * j - N + 1];
for (int j = 0; j < N / 2; j++) count[j] = 0;
}
The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.
If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like lower_bound -= N * bin_size may misbehave (in this case, if N * bin_size is much smaller than lower_bound). I recommend that bin_size be a power of the radix (usually two) at all times.