Non-maximum suppression in Canny's algorithm: optimize with SSE

Non-maximum suppression in Canny's algorithm: optimize with SSE - c++

I have the "honor" to improve the runtime of the following code of someone else. (it's a non-maximum supression from the canny - algorithm"). My first thought was to use SSE-intrinsic code, i'm very new in this area, so my question is.
Is there any chance to do this?
And if so, can someone give me a few hints?
void vNonMaximumSupression(
float* fpDst,
float const*const fpMagnitude,
unsigned char const*const ucpGradient, ///< [in] 0 -> 0°, 1 -> 45°, 2 -> 90°, 3 -> 135°
int iXCount,
int iXOffset,
int iYCount,
int ignoreX,
int ignoreY)
{
memset(fpDst, 0, sizeof(fpDst[0]) * iXCount * iXOffset);
for (int y = ignoreY; y < iYCount - ignoreY; ++y)
{
for (int x = ignoreX; x < iXCount - ignoreX; ++x)
{
int idx = iXOffset * y + x;
unsigned char dir = ucpGradient[idx];
float fMag = fpMagnitude[idx];
if (dir == 0 && fpMagnitude[idx - 1] < fMag && fMag > fpMagnitude[idx + 1] ||
dir == 1 && fpMagnitude[idx - iXCount + 1] < fMag && fMag > fpMagnitude[idx + iXCount - 1] ||
dir == 2 && fpMagnitude[idx - iXCount] < fMag && fMag > fpMagnitude[idx + iXCount] ||
dir == 3 && fpMagnitude[idx - iXCount - 1] < fMag && fMag > fpMagnitude[idx + iXCount + 1]
)
fpDst[idx] = fMag;
else
fpDst[idx] = 0;
}
}
}

Discussion
As #harold noted, the main problem for vectorization here is that the algorithm uses different offset for each pixel (specified by direction matrix). I can think of several potential ways of vectorization:
#nikie: evaluate all branches at once, i.e. compare each pixel with all of its neighbors. The results of these comparisons are blended according to the direction values.
#PeterCordes: load a lot of pixels into SSE registers, then use _mm_shuffle_epi8 to choose only the neighbors in the given direction. Then perform two vectorized comparisons.
(me): use scalar instructions to load the proper two neighboring pixels along the direction. Then combine these values for four pixels into SSE registers. Finally, do two comparisons in SSE.
The second approach is very hard to implement efficiently, because for a pack of 4 pixels, there are 18 neighboring pixels to choose from. That would require too much shuffles, I think.
The first approach looks nice, but it would perform four times more operations per pixel. I suppose the speedup of vector instructions would be overwhelmed by too much computations done.
I suggest using the third approach. Below you can see hints on improving performance.
Hybrid approach
First of all, we want to make scalar code as fast as possible. The code presented by you contains too much branches. Most of them are not predictable, for instance the switch by direction.
In order to remove branches, I suggest creating an array delta = {1, stride - 1, stride, stride + 1}, which gives index offset from direction. By using this array, you can find indices of neighboring pixels to compare with (without branches). Then you do two comparisons. Finally, you can write a ternary operator like res = (isMax ? curr : 0);, hoping that compiler can generate branchless code for it.
Unfortunately, compiler (at least MSVC2013) is not smart enough to avoid branch by isMax. That's why we can benefit from rewriting the inner cycle with scalar SSE intrinsics. Look up the guide for reference. You need mostly intrinsics ending with _ss, since the code is completely scalar.
Finally, we can vectorize everything except for loading neighboring pixels. In order to load neighboring pixels, we can use _mm_setr_ps intrinsics with scalar arguments, asking the compiler to generate some good code for us =)
__m128 forw = _mm_setr_ps(src[idx+0 + offset0], src[idx+1 + offset1], src[idx+2 + offset2], src[idx+3 + offset3]);
__m128 back = _mm_setr_ps(src[idx+0 - offset0], src[idx+1 - offset1], src[idx+2 - offset2], src[idx+3 - offset3]);
I have just implemented it myself. Tested in single thread on Ivy Bridge 3.4Ghz. A random image of 1024 x 1024 resolution was used as a source. The results (in milliseconds) are:
original: 13.078 //your code
branchless: 8.556 //'branchless' code
scalarsse: 2.151 //after rewriting to sse intrinsics
hybrid: 1.159 //partially vectorized code
They confirm performance improvements on each step. The final code needs a bit more than one millisecond to process a one-megapixel image. The total speedup is about 11.3 times. Indeed, you can get better performance on GPU =)
I hope the presented information would be enough for you to reproduce the steps. If you are seeking terrible spoilers, look here for my implementations of all these stages.

Related

How to optimize the following fragment of C++ code - zero crossings in a volume

I am struggling to optimize the following fragment of code. The function is being called for every voxel in a 320x320x320 volume where each voxel is a 16bit grayscale value. The volume is stored as a series of planes (cross sections) and each plane is a contiguous 1D array, hence for example the position of a voxel below current voxel becomes currentPosition + pixelsPerRow and position to the left of it becomes currentPosition - 1.
The function checks for both zero crossings in the volume and if the absolute value of current and neighboring voxel is above certain threshold. This is a neccessary part of Marr-Hildreth edge detector.
currentPosition is the current voxel and relativePosition can be either the current voxel too (in that case zero crossings are checked in 8 directions around it in the same plane) or it can be voxel directly above or directly below it. This way, for every voxel, 27 checks are performed which covers all the possible directions in 3D.
Perhaps it is possible to rearrange the function in such the way that its execution will be faster. I already tried to arrange the order of checks in such the way that branch prediction has a slightly better chance to kick in, but maybe it's possible to speed it up even more. For now it takes 50% processing time of a much larger application so it begs for some optimizations.
bool zeroCrossing(int16_t* currentPosition, int16_t* relativePosition, int pixelsPerRow, int threshold)
{
return *currentPosition * *(relativePosition - pixelsPerRow - 1) < 0 && abs(*currentPosition + *(relativePosition - pixelsPerRow - 1)) > threshold
|| *currentPosition * *(relativePosition - pixelsPerRow) < 0 && abs(*currentPosition + *(relativePosition - pixelsPerRow)) > threshold
|| *currentPosition * *(relativePosition - pixelsPerRow + 1) < 0 && abs(*currentPosition + *(relativePosition - pixelsPerRow + 1)) > threshold
|| *currentPosition * *(relativePosition - 1) < 0 && abs(*currentPosition + *(relativePosition - 1)) > threshold
|| *currentPosition * *(relativePosition) < 0 && abs(*currentPosition + *(relativePosition)) > threshold
|| *currentPosition * *(relativePosition + 1) < 0 && abs(*currentPosition + *(relativePosition + 1)) > threshold
|| *currentPosition * *(relativePosition + pixelsPerRow - 1) < 0 && abs(*currentPosition + *(relativePosition + pixelsPerRow - 1)) > threshold
|| *currentPosition * *(relativePosition + pixelsPerRow) < 0 && abs(*currentPosition + *(relativePosition + pixelsPerRow)) > threshold
|| *currentPosition * *(relativePosition + pixelsPerRow + 1) < 0 && abs(*currentPosition + *(relativePosition + pixelsPerRow + 1)) > threshold;
}

My gut instinct is that this code lends itself well to parallelization. Either use AVX(2), or offload this to a GPU. That would take it outside the realm of C++, but that is a reasonable thing to do for the core function of a program.
I'm assuming you already use threads to parallelize the operation, because that is pretty trivial. Note that with AVX you'd still need threads; each CPU core has it's own AVX unit.

If there is a zero crossing, it will happen in between two neighboring pixels. When running through the volume, you need to check each pair of neighbors only once. If you apply your function to each pixel, you'll check each pair twice. (I use the term pixel for the elements of a 3D image too, I don't like the term voxel a whole lot).
Also, you check pairs of pixels that share an edge or a vertex, you only need to check those that share a face. If there is a zero crossing between pixels a and d in the figure below, there must be one between a and b or between a and c.
a b
c d
Thus, for each pixel, you only need to check three neighbors, not 27. This would reduce your execution time to 1/9.
However, this does not exactly account for your magnitude check, the difference between a and d could be larger than that between either of its other two neighbors. However, I don't think this is important.
On that note, your magnitude check is wrong: if a and b differ in sign and are both very large (an important zero crossing), then abs(a + b) could be 0, and you wouldn't count it. You probably want to take the difference!

Faster Method for Multiple Bilinear Interpolation?

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:
(NOTE: 'I' is a 1D array containing ordered rows of image data)
//Pre-definition structure (in header)
struct blnData{
float* X;
float* Y;
int* I;
float X0;
float Y0;
float delX;
float delY;
};
//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
blnData bln;
//Create pointers to X, Y, and I vectors
bln.X = X;
bln.Y = Y;
bln.I = I;
//Store offset and step values for X and Y
bln.X0 = X[0];
bln.delX = X[1] - X[0];
bln.Y0 = Y[0];
bln.delY = Y[1] - Y[0];
return bln;
}
//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
float Ixy;
//Return -1 if the target point is outside the image matrix
if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
Ixy = 0;
//Otherwise, apply bilinear interpolation
else
{
//Define known image width
int W = 200;
//Find nearest indices for interpolation
int i = floor((x - bln.X0) / bln.delX);
int j = floor((y - bln.Y0) / bln.delY);
//Interpolate I at (xi, yj)
Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
(
bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
);
}
return Ixy;
}
EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.
int Xs = flat.dim[0];
int Ys = flat.dim[1];
int* Iarr = flat.imgdata.data();
float II, x, y;
bln = blnSetup(X, Y, Iarr);
for (int j = 0; j < flat.imgdata.size(); j++)
{
x = 1.2*X[j % Xs];
y = 1.2*Y[j / Xs];
II = bilinear(x, y, bln);
proj.imgdata[j] = (int)II;
}
Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.
According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?
P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.

I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).
If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.
Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.

"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.
As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

Fast way to convert ARGB texture to next power of 2 texture

We have some old devices that don't support non-pot textures and we have a function that converts ARGB textures to next power of 2 texture. The problem is that it's quite slow and we're wondering if there is a better approach to convert these textures.
void PotTexture()
{
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
std::vector<unsigned char> pottedImageData;
pottedImageData.resize(u2 * v2 * 4);
size_t y, x, c;
for (y = 0; y < imageData.height; y++)
{
for (x = 0; x < imageData.width; x++)
{
for (c = 0; c < 4; c++)
{
pottedImageData[4 * u2 * y + 4 * x + c] = imageData.convertedData[4 * imageData.width * y + 4 * x + c];
}
}
}
imageData.width = u2;
imageData.height = v2;
std::swap(imageData.convertedData, pottedImageData);
}
On some devices this can easily use 100% of the CPU so any optimizations would be amazing. Are there any existing functions that I could look at that perform this conversion?
Edit:
I've optimized the above loop slightly to:
for (y = 0; y < imageData.height; y++)
{
memcpy(
&(pottedImageData[y * u2 * 4]),
&(imageData.convertedData[y * imageData.width * 4]),
imageData.width * 4);
}

Even devices that don't support NPOT texture should support NPOT load.
Create the texture as an exact power of 2 and NO CONTENT using glTexImage2D, passing a null pointer for data.
data may be a null pointer. In this case, texture memory is allocated to accommodate a texture of width width and height height. You can then download subtextures to initialize this texture memory. The image is undefined if the user tries to apply an uninitialized portion of the texture image to a primitive.
Then use glTexSubImage2D to upload a NPOT image, which occupies only a portion of the total texture. This can be done without any CPU-side image rearrangement.

Having had a similar problem in a program I wrote, I took a very different approach. Rather than stretch the source texture, I just copied it into the top left corner of an otherwise empty power-of-two texture.
Then in the pixel shader you use a pair of floats to adjust s,t values so you fetch from just the top left corner.
float sAdjust = static_cast<float>(textureWidth) / static_cast<float>(containerWidth)
float tAdjust = static_cast<float>(textureHeight) / static_cast<float>(containerHeight)
is how you compute them, and to use them you'll get a Vec2 holding the s,t coordinates, just multiply s by sAdjust and t by tAdjust before using them to fetch. If you're using Direct3D, it'd be something akin to this:
D3DXVECTOR4 stAdjust;
stAdjust.x = sAdjust;
stAdjust.y = tAdjust;
// Transfer stAdjust into a float4 inside your pixel shader, call it stAdjust in there
Now in the pixel shader assume you have:
float2 texcoord;
float4 stAdjust;
you just say:
texcoord.x = texcoord.x * stAdjust.x;
texcoord.y = texcoord.y * stAdjust.y;
before using texcoord. Sorry I can't tell you how to do this in GLSL, but you get the general idea.

Okay, the very first optimization can be done here:
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
What you want to do is (for each dimension) find the floor of the logarithm-base2 (log2) and put that into 2**n+1. The standard math library has function log2 but it operates on floating point. But we can use is. 2**n can be written as 1 << n. So this gives
size_t const dim_p2_… = 1 << (int)floor(log2(dim_…)+1);
Better but not yet ideal, because of that float conversion. The Bit Twiddling hacks document has a few functions for integer ilog2: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
But we're still not optimal. Let me introduce you to Compiler intrinsics, which translate into machine instructions, if the machine in question can do it on the metal.
GNU GCC: int __builtin_ffs (unsigned int x), which returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
MSVC++: _BitScanReverse, which returns the length of the run of the most significant bits set zero. So _BitScanReverse is like builtin_ffs - 1 (there's also a builtin_clz which behaves exactly like BitScanReverse.
So we can do
#define ilog2_p1(x) (__builtin_ffs(x))
or
#define ilog2_p1(x) (__BitScanReverse(x)+1)
and use that.
size_t const dim_p2_… = 1 << (int)floor(ilog2_p1(dim_…));
While we're at bit twiddling: We can save that whole ordeal if a texture is already in power of two format. A few years ago I (independently) rediscovered the wonderfully portable bit twiddling trick, exploiting the properties of complement-2 integers. You can also find it in the bit twiddles document. But the type neutral, concise macro form is rarely seen. So here it is:
#define ISPOW2(x) ( (x) && !( (x) & ((x) - 1) ) )
You're using C++ so templates are in order:
template<typename T> bool ispow2(T const x) { return x && !( x & (x - 1) ); }
Then Ben Voight already told you, how to use glTexSubImage2D to load that into the texture. Also have a look at the GL_ARB_texture_rectangle extension, that allows to load NPOT textures, but without the ability for mipmapping and advanced filtering. But it might be a viable choice for you.
If you ever feel the need to scale the texture it's always worth looking into dual spaces. In this case the spatial frequency domain dual space. Upscaling a signal is essentially a impulse response. As such it can be described as a convolution. Convolutions usually are O(n²) in complexity. But due to the Fourier Convolution theorem in Fourier space the equivalent is simple multiplication, so it becomes O(n). FFT can be done with O(n log n), so the total complexity is about O(n + 2n log n), which is much better.

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

Can my loop be optimized any more?

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?
I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.
Any help is appreciated!
Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)
typedef double real;
struct Particle
{
Vector pos, vel, acc, jerk;
Vector oldPos, oldVel, oldAcc, oldJerk;
real mass;
};
class Vector
{
private:
real vec[3];
public:
// Operators defined here
};
real Gravity::interact(Particle *p, size_t numParticles)
{
PROFILE_FUNC();
real tau_q = 1e300;
for (size_t i = 0; i < numParticles; i++)
{
p[i].jerk = 0;
p[i].acc = 0;
}
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
Vector r = p[j].pos - p[i].pos;
Vector v = p[j].vel - p[i].vel;
real r2 = lengthsq(r);
real v2 = lengthsq(v);
// Calculate inverse of |r|^3
real r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
Vector da = r * r3i;
Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
real mij = p[i].mass + p[j].mass;
real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
real tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
return sqrt(sqrt(tau_q));
}

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.
4a. Try replacing the if (...) tau_q = with
tau_q=minimum(...,...)
Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.
4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]
[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.
Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.
If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.
Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.
Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.
Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.
One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.
If you move the mass out too, you could store things like this:
struct ParticleData
{
Vector pos, vel, acc, jerk;
};
ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...
then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.
If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g
struct ParticleData
{
// data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
Vector4 data_x;
// data_y[0] == pos.y, data_y[1] = vel.y, etc.
Vector4 data_y;
// data_z[0] == pos.z, data_y[1] = vel.z, etc.
Vector4 data_z;
};
where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)
Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.
With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.
SSE example:
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp
Novel cache algorithm, explanation and example code:
http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp
You might also find the following deck I gave at Seattle Code Camp interesting:
http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/
Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?
What's real? Can it be a float?
Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

Would you benefit from the famous "fast inverse square root" algorithm?
float InvSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.
Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.
Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.
Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.
You may want to profile to see if this is really the bottleneck first.

Thing I look for is branching, they tend to be performance killers.
You can use loop unrolling.
also, remember multiple with smaller parts of the problem :-
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache
You could thread this to make better use of multiple cores
you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....
but really need to know where its costing you the most

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.
Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
r = p[j].pos - p[i].pos;
v = p[j].vel - p[i].vel;
r2 = lengthsq(r);
v2 = lengthsq(v);
// Calculate inverse of |r|^3
r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
da = r * r3i;
dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
mij = p[i].mass + p[j].mass;
tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}

You can replace any occurrence of:
a = b/c
d = e/f
with
icf = 1/(c*f)
a = bf*icf
d = ec*icf
if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.
You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js