How to optimize this math operation for speed

How to optimize this math operation for speed - c++

I'm trying to optimize a function taking a good chunk of execution time, which computes the following math operation many times. Is there anyway to make this operation faster?
float total = (sqrt(
((point_A[j].length)*(point_A[j].length))+
((point_B[j].width)*(point_B[j].width))+
((point_C[j].height)*(point_C[j].height))
));

If memory is cheap then you could do the following thereby improving CPU cache hit rate. Since you haven't posted more details, so I will make some assumptions here.
long tmp_len_square[N*3];
for (int j = 0; j < N; ++j) {
tmp_len_square[3 * j] = (point_A[j].length)*(point_A[j].length);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 1] = (point_B[j].width)*(point_B[j].width);
}
for (int j = 0; j < N; ++j) {
tmp_len_square[(3 * j) + 2] = (point_C[j].height)*(point_C[j].height);
}
for (int j = 0; j < N; ++j) {
float total = sqrt(tmp_len_square[3 * j] +
tmp_len_square[(3 * j) + 1] +
tmp_len_square[(3 * j) + 2]);
// ...
}

Rearrange the data into this:
float *pointA_length;
float *pointB_width;
float *pointC_height;
That may require some level of butchering of your data structures, so you'll have to choose whether it's worth it or not.
Now what we can do is write this:
void process_points(float* Alengths, float* Bwidths, float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i++) {
output[i] = sqrt(Alengths[i] * Alengths[i] +
Bwidths[i] * Bwidths[i] +
Cheights[i] * Cheights[i]);
}
}
Writing it like this allows it to be auto-vectorized. For example, GCC targeting AVX and with -fno-math-errno -ftree-vectorize, can vectorize that loop. It does that with a lot of cruft though. __restrict__ and alignment attributes only improve that a little. So here's a hand-vectorized version as well: (not tested)
void process_points(float* Alengths,
float* Bwidths,
float* Cheights,
float* output, int n)
{
for (int i = 0; i < n; i += 8) {
__m256 a = _mm256_load_ps(Alengths + i);
__m256 b = _mm256_load_ps(Bwidths + i);
__m256 c = _mm256_load_ps(Cheights + i);
__m256 asq = _mm256_mul_ps(a, a);
__m256 sum = _mm256_fmadd_ps(c, c, _mm256_fmadd_ps(b, b, asq));
__m256 hsum = _mm256_mul_ps(sum, _mm256_set1_ps(0.5f));
__m256 invsqrt = _mm256_rsqrt_ps(sum);
__m256 s = _mm256_mul_ps(invsqrt, invsqrt);
invsqrt = _mm256_mul_ps(sum, _mm256_fnmadd_ps(hsum, s, _mm256_set1_ps(1.5f)));
_mm256_store_ps(output + i, _mm256_mul_ps(sum, invsqrt));
}
}
This makes a number of assumptions:
all pointers are 32-aligned.
n is a multiple of 8, or at least the buffers have enough padding that they're never accessed out of bounds.
the input buffers are not aliased with the output buffer (they could be aliased among themselves, but .. why)
the slightly reduced accuracy of the square root computed this way is OK (accurate to approximately 22 bits, instead of correctly rounded).
the sum of squares computed with fmadd can be slightly different than if it's computed using multiplies and adds, I assume that's OK too
your target supports AVX/FMA so this will actually run
The method for computing the square root I used here is using an approximate reciprocal square root, an improvement step (y = y * (1.5 - (0.5 * x * y * y))) and then a multiplication by x because x * 1/sqrt(x) = x/sqrt(x) = sqrt(x).

You can eventually try to optimize the sqrt function itself. May I suggest you to have a look at this link:
Best Square Root Method

Your question could be improved by adding a little more context. Is your code required to be portable, or are you targeting a particular compiler, or a specific processor or processor family? Perhaps you're willing to accept a general baseline version with target-specific optimised versions selected at runtime?
Also, there's very little context for the line of code you give. Is it in a tight loop? Or is it scattered in a bunch of places in conditional code in such a loop?
I'm going to assume that it's in a tight loop thus:
for (int j=0; j<total; ++j)
length[j] = sqrt(
(point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
I'm also going to assume that your target processor is multi-core, and that the arrays are distinct (or that the relevant elements are distinct), then an easy win is to annotate for OpenMP:
#pragma omp parallel for
for (int j=0; j<total; ++j)
length[j] = sqrt((point_A[j].length)*(point_A[j].length) +
(point_B[j].width)*(point_B[j].width) +
(point_C[j].height)*(point_C[j].height));
Compile with g++ -O3 -fopenmp -march=native (or substitute native with your desired target processor architecture).
If you know your target, you might be able to benefit from parallelisation of loops with the gcc flag -ftree-parallelize-loops=n - look in the manual.
Now measure your performance change (I'm assuming that you measured the original, given that this is an optimisation question). If it's still not fast enough for you, then it's time to consider changing your data structures, algorithms, or individual lines of code.

Related

C++ performance optimization for linear combination of large matrices?

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.

First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}

Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.

how to further optimize this code using Openmp multithreading

i have this code snippet I came across and I'm trying to use OpenMP to make it run faster than the original version. However, in comparison this seems to be taking about the same amount of time as the older version. Not sure why this multithreading approach is not working to optimize it. Like the timings are still the same. What can I do to make it run even faster?:
void sobel(unsigned char *data_out,
unsigned char *data_in, unsigned height,
unsigned width)
{
/* Sobel matrices for convolution */
int sobelv[3][3] = { {-1, -2, -1}, {0, 0, 0}, {1, 2, 1} };
int sobelh[3][3] = { {-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1} };
unsigned int size, i, j;
int lay;
size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(64) shared (data_in,data_out,sobelv, sobelh,size) private (i,j,lay)
#endif
for (lay = 0; lay < 3; lay++) {
for (i = 1; i < height - 1; ++i) {
for (j = 1; j < width - 1; j++) {
int sumh, sumv;
int k = -1, l = -1;
sumh = 0;
sumv = 0;
/* Convolution part */
for ( k = -1; k < 2; k++)
for (l = -1; l < 2; l++) {
sumh =
sumh + sobelh[k + 1][l + 1] *(int) data_in[lay * size + (i + k) * width +(j + l)];
sumv =
sumv + sobelv[k + 1][l +1] * (int) data_in[lay *size +(i +k) *width + (j +l)];
}
int temp = abs(sumh / 8) + abs(sumv / 8);
data_out[lay * size + i * width + j] =
(temp > 255? 255: temp);
}
}
}
}
the main function is simply calling this function like this:
sobel(data_out, data_in, header.height, header.width);
any help would be appreciated!! :)

The best optimization you can apply is to vectorize the code. Compilers can often auto-vectorize the code when it is sufficiently simple but this one is too complex for most compilers (including GCC and Clang) to vectorize it.
Manual code vectorization is cumbersome error-prone and often make the code (more) dependent of a specific architecture (eg. x86-64). However, you can help the compiler to generate it for you. To do that, you it better to:
avoid mixing signed/unsigned types and type of different size;
use the smallest possible types fitting your needs;
avoid loops and conditions in the vectorized loop;
access data contiguously;
avoid integer multiplication/division with small types (on x86-64 and/or with some compilers);
prefer using local short-scoped variables when this is possible;
enable advanced optimizations like -O3 for GCC/Clang, possibly coupled with -mavx2 if your target platform supports the AVX-2 instruction set, or with -march=native if your target platform is the one where the program is built;
be careful about aliasing (possibly using temporary arrays, strict aliasing rules, memcpy calls, restrict compiler extensions, etc.) [thanks to #Laci].
You can check the generated assembly code to see if the code is vectorized or not.
Moreover, using collapse(2) should enough here to get a good speed-up. collapse(3) can introduce some unwanted overheads due to the last loop being shared amongst threads. collapse(64) is not correct (it cannot be bigger than the number of nested loops).
Here is the resulting untested code:
#include <cmath>
void sobel(unsigned char *data_out,
unsigned char *data_in, int height,
int width)
{
const int size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(2) shared(data_in,data_out,size)
#endif
for (int lay = 0; lay < 3; lay++)
{
for (int i = 1; i < height - 1; ++i)
{
for (int j = 1; j < width - 1; j++)
{
short a11 = data_in[lay * size + (i-1) * width + (j-1)];
short a12 = data_in[lay * size + (i-1) * width + j];
short a13 = data_in[lay * size + (i-1) * width + (j+1)];
short a21 = data_in[lay * size + i * width + (j-1)];
short a23 = data_in[lay * size + i * width + (j+1)];
short a31 = data_in[lay * size + (i+1) * width + (j-1)];
short a32 = data_in[lay * size + (i+1) * width + j];
short a33 = data_in[lay * size + (i+1) * width + (j+1)];
short sumh = a13 - a11 + (a23 - a21) + (a23 - a21) + a33 - a31;
short sumv = a31 + a32 + a32 + a33 - (a11 + a12 + a12 + a13);
short temp = (abs(sumh) >> 3) + (abs(sumv) >> 3);
data_out[lay * size + i * width + j] = (temp > 255? 255: temp);
}
}
}
}
I expect the code to be several time faster (especially true in sequential) -- typically about 10 times faster with AVX-2 since the processor can work on 16 values at once (despite a bit more work related to SIMD instructions).
Another possible optimization you can do is called register blocking. The idea is to change the loop so that you work on small fixed-size tiles (eg. 2x2 or 4x2 SIMD values). This should reduces the number of L1-cache loads and the number of char-to-short/short-to-char conversions to perform. However, this is hard to help the compiler so it does this optimization correctly on such a code. It is probably better to use SIMD intrinsics if performance is critical and do the register blocking yourself.

Is Eigen library matrix/vector manipulation faster than .net ones if the matrix is dense and unsymmetrical?

I have some matrix operations, mostly dealing with operations like running over all the each of the rows and columns of the matrix and perform multiplication a*mat[i,j]*mat[ii,j]:
public double[] MaxSumFunction()
{
var maxSum= new double[vector.GetLength(1)];
for (int j = 0; j < matrix.GetLength(1); j++)
{
for (int i = 0; i < matrix.GetLength(0); i++)
{
for (int ii = 0; ii < matrix.GetLength(0); ii++)
{
double wi= Math.Sqrt(vector[i]);
double wii= Math.Sqrt(vector[ii]);
maxSum[j] += SomePowerFunctions(wi, wii) * matrix[i, j]*matrix[ii, j];
}
}
}
}
private double SomePowerFunctions(double wi, double wj)
{
var betaij = wi/ wj;
var numerator = 8 * Math.Sqrt(wi* wj) * Math.Pow(betaij, 3.0 / 2)
* (wi+ betaij * wj);
var dominator = Math.Pow(1 - betaij * betaij, 2) +
4 * wi* wj* betaij * (1 + Math.Pow(betaij, 2)) +
4 * (wi* wi+ wj* wj) * Math.Pow(betaij, 2);
if (wi== 0 && wj== 0)
{
if (Math.Abs(betaij - 1) < 1.0e-8)
return 1;
else
return 0;
}
return numerator / dominator;
}
I found such loops to be particularly slow if the matrix size is big.
I want the speed to be fast. So I am thinking about re-implementing these algorithms using the Eigen library.
My matrix is not symmetrical, not sparse and contains no regularity that any solver can exploit reliably.
I read that Eigen solver can be fast because of:
Compiler optimization
Vectorization
Multi-thread support
But I wonder those advantages are really applicable given my matrix characteristics?
Note: I could have just run a sample or two to find out, but I believe that asking the question here and have it documented on the Internet is going to help others as well.

Before thinking about low level optimizations, look at your code and observe that many quantities are recomputed many time. For instance, f(wi,wii) does not depend on j, so they could either be precomputed once (see below) or you can rewrite your loop to make the loop on j the nested one. Then the nested loop will simply be a coefficient wise product between a constant scalar and two columns of your matrix (I don't .net and assume j is indexing columns). If the storage if column-major, then this operation should be fully vectorized by your compiler (again, I don't know .net, but any C++ compiler will do, and if you Eigen, it will be vectorized explicitly). This should be enough to get a huge performance boost.
Depending on the sizes of matrix, you might also try to leverage optimized matrix-matrix implementation by precomputed f(wi,wii) into a MatrixXd F; (using Eigen's language), and then observe that the whole computation amount to:
VectorXd v = your_vector;
MatrixXd F = MatrixXd::nullaryExpr(n,n,[&](Index i,Index j) {
return SomePowerFunctions(sqrt(v(i)), sqrt(v(j)));
});
MatrixXd M = your_matrix;
MatrixXd FM = F * M;
VectorXd maxSum = (M.array() * FM.array()).colwise().sum();

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

Different results between Debug and Release

I have the problem that my code returns different results when comparing debug to release. I checked that both modes use /fp:precise, so that should not be the problem. The main issue I have with this is that the complete image analysis (its an image understanding project) is completely deterministic, there's absolutely nothing random in it.
Another issue with this is the fact that my release build actually always returns the same result (23.014 for the image), while debug returns some random value between 22 and 23, which just should not be. I've already checked whether it may be thread related, but the only part in the algorithm which is multi-threaded returns the precisely same result for both debug and release.
What else may be happening here?
Update1: The code I now found responsible for this behaviour:
float PatternMatcher::GetSADFloatRel(float* sample, float* compared, int sampleX, int compX, int offX)
{
if (sampleX != compX)
{
return 50000.0f;
}
float result = 0;
float* pTemp1 = sample;
float* pTemp2 = compared + offX;
float w1 = 0.0f;
float w2 = 0.0f;
float w3 = 0.0f;
for(int j = 0; j < sampleX; j ++)
{
w1 += pTemp1[j] * pTemp1[j];
w2 += pTemp1[j] * pTemp2[j];
w3 += pTemp2[j] * pTemp2[j];
}
float a = w2 / w3;
result = w3 * a * a - 2 * w2 * a + w1;
return result / sampleX;
}
Update2:
This is not reproducible with 32bit code. While debug and release code will always result in the same value for 32bit, it still is different from the 64bit release version, and the 64bit debug still returns some completely random values.
Update3:
Okay, I found it to certainly be caused by OpenMP. When I disable it, it works fine. (both Debug and Release use the same code, and both have OpenMP activated).
Following is the code giving me trouble:
#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
for(int r = 0; r < 53; ++r)
{
for(int k = 0; k < 3; ++k)
{
for(int c = 0; c < 30; ++c)
{
for(int o = -1; o <= 1; ++o)
{
/*
r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
c: 0-29, in steps of 1, representing the absorption value (collagene)
iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
*/
int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;
if(offset < 0 || offset == fSamples.size())
{
continue;
}
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
if(bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset = (-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
}
}
}
}
Note: with Release mode and OpenMP activated I get the same result as with deactivating OpenMP. Debug mode and OpenMP activated gets a different result, OpenMP deactivated gets the same result as with Release.

At least two possibilities:
Turning on optimization may result in the compiler reordering operations. This can introduce small differences in floating-point calculations when compared to the order executed in debug mode, where operation reordering does not occur. This may account for numerical differences between debug and release, but does not account for numerical differences from one run to the next in debug mode.
You have a memory-related bug in your code, such as reading/writing past the bounds of an array, using an uninitialized variable, using an unallocated pointer, etc. Try running it through a memory checker, such as the excellent Valgrind, to identify such problems. Memory related errors may account for non-deterministic behavior.
If you are on Windows, then Valgrind isn't available (pity), but you can look here for a list of alternatives.

To elaborate on my comment, this is the code that is most probably the root of your problem:
#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
{
...
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(bestHit > last)
{
last is only assigned to before it is read again so it is a good candidate for being a lastprivate variable, if you really need the value from the last iteration outside the parallel region. Otherwise just make it private.
Access to bestHit, cVal, rad, and veneOffset should be synchronised by a critical region:
#pragma omp critical
if (bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
Note that by default all variables, except the counters of parallel for loops and those defined inside the parallel region, are shared, i.e. the shared clause in your case does nothing unless you also apply the default(none) clause.
Another thing that you should be aware of is that in 32-bit mode Visual Studio uses x87 FPU math while in 64-bit mode it uses SSE math by default. x87 FPU does intermediate calculations using 80-bit floating point precision (even for calculations involving float only) while the SSE unit supports only the standard IEEE single and double precisions. Introducing OpenMP or any other parallelisation technique to a 32-bit x87 FPU code means that at certain points intermediate values should be converted back to the single precision of float and if done sufficiently many times a slight or significant difference (depending on the numerical stability of the algorithm) could be observed between the results from the serial code and the parallel one.
Based on your code, I would suggest that the following modified code would give you good parallel performance because there is no synchronisation at each iteration:
#pragma omp parallel private(last)
{
int rBest = 0, kBest = 0, cBest = 0;
float myBestHit = bestHit;
#pragma omp for
for(int r = 0; r < 53; ++r)
{
for(int k = 0; k < 3; ++k)
{
for(int c = 0; c < 30; ++c)
{
for(int o = -1; o <= 1; ++o)
{
/*
r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
c: 0-29, in steps of 1, representing the absorption value (collagene)
iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
*/
int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;
if(offset < 0 || offset == fSamples.size())
{
continue;
}
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(myBestHit > last)
{
myBestHit = last;
rBest = r;
cBest = c;
kBest = k;
}
last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
if(myBestHit > last)
{
myBestHit = last;
rBest = r;
cBest = c;
kBest = k;
}
}
}
}
}
#pragma omp critical
if (bestHit > myBestHit)
{
bestHit = myBestHit;
rad = (rBest+8)*0.25f;
cVal = cBest * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * kBest + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
}
It only stores the values of the parameters that give the best hit in each thread and then at the end of the parallel region it computes rad, cVal and veneOffset based on the best values. Now there is only one critical region, and it is at the end of code. You can get around it also, but you would have to introduce additional arrays.

One thing to double check is that all variables are initialized. Many times un-optimized code (Debug mode) will initialize memory.

I would have said variable initialization in debug vs not there in release. But your results would not back this up (reliable result in release).
Does your code rely on any specific offsets or sizes? Debug build would place guards bytes around some allocations.
Could it be floating point related?
The debug floating point stack is different to the release which is built for more efficiency.
Look here: http://thetweaker.wordpress.com/2009/08/28/debugrelease-numerical-differences/

Just about any undefined behavior can account for this: uninitialized
variables, rogue pointers, multiple modifications of the same object
without an intervening sequence point, etc. etc. The fact that the
results are at times unreproduceable argues somewhat for an
uninitialized variable, but it can also occur from pointer problems or
bounds errors.
Be aware that optimization can change results, especially on an Intel.
Optimization can change which intermediate values spill to memory, and
if you've not carefully used parentheses, even the order of evaluation
in an expression. (And as we all know, in machine floating point, (a +
b) + c) != a + (b + c).) Still the results should be deterministic:
you will get different results according to the degree of optimization,
but for any set of optimization flags, you should get the same results.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js