Optimizing compute shaders

Optimizing compute shaders - c++

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.
I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:
#version 450 core
uniform uint NumParticles;
layout (std430, binding = 0) buffer bla
{
double rIn[];
};
layout (std430, binding = 1) writeonly buffer bla2
{
double aOut[];
};
layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
void main()
{
int n;
double dist3, dist2;
dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);
a.x = a.y = a.z = 0;
for (n = 0; n < NumParticles; n++)
{
if (n != gl_GlobalInvocationID.x)
{
diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
dist2 = dot(diff, diff);
dist3 = 1.0 / (sqrt(dist2) * dist2);
a += diff * dist3;
}
}
aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}
I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.
I can provide some code for the above examples, but I don't think, this would be helpful.
I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.
So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.

This operation as defined doesn't seem like a good one for GPU parallelism. It's very hungry in terms of memory accesses, as complete processing for one particle requires reading the data for every other particle in the system.
If you want to keep the algorithm as is, you can implement it more optimally. As it stands, each work item does all of the processing for a particular particle all at once. That's a huge number of memory operations happening all at once.
Instead, split your particles into blocks, sized for a work group. Each work group operates on a block of source particles and a block of test particles (which may be the same block). The test particles should be loaded into shared memory, so each work group can repeatedly read test data quickly. So a single work group only does a portion of the tests for each block of source particles.
The big difficulty now is writing the data. Since multiple work groups are potentially be writing the added forces to the same source particles, you need to use some mechanism to either atomically increment the source particle data or write the data to a temporary memory buffer. A second compute shader process can run over the temporary buffers and combine the data in a reduction process.

Related

Faster Method for Multiple Bilinear Interpolation?

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:
(NOTE: 'I' is a 1D array containing ordered rows of image data)
//Pre-definition structure (in header)
struct blnData{
float* X;
float* Y;
int* I;
float X0;
float Y0;
float delX;
float delY;
};
//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
blnData bln;
//Create pointers to X, Y, and I vectors
bln.X = X;
bln.Y = Y;
bln.I = I;
//Store offset and step values for X and Y
bln.X0 = X[0];
bln.delX = X[1] - X[0];
bln.Y0 = Y[0];
bln.delY = Y[1] - Y[0];
return bln;
}
//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
float Ixy;
//Return -1 if the target point is outside the image matrix
if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
Ixy = 0;
//Otherwise, apply bilinear interpolation
else
{
//Define known image width
int W = 200;
//Find nearest indices for interpolation
int i = floor((x - bln.X0) / bln.delX);
int j = floor((y - bln.Y0) / bln.delY);
//Interpolate I at (xi, yj)
Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
(
bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
);
}
return Ixy;
}
EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.
int Xs = flat.dim[0];
int Ys = flat.dim[1];
int* Iarr = flat.imgdata.data();
float II, x, y;
bln = blnSetup(X, Y, Iarr);
for (int j = 0; j < flat.imgdata.size(); j++)
{
x = 1.2*X[j % Xs];
y = 1.2*Y[j / Xs];
II = bilinear(x, y, bln);
proj.imgdata[j] = (int)II;
}
Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.
According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?
P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.

I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).
If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.
Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.

"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.
As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

Optimize mathematical expressions

I am frustrated with how much time the fitch software take to do some simple computations. I profiled it with Intel VTune and it seems that 52% of the CPU time is spent in the nudists() function:
void nudists(node *x, node *y)
{
/* compute distance between an interior node and tips */
long nq=0, nr=0, nx=0, ny=0;
double dil=0, djl=0, wil=0, wjl=0, vi=0, vj=0;
node *qprime, *rprime;
qprime = x->next;
rprime = qprime->next->back;
qprime = qprime->back;
ny = y->index;
dil = qprime->d[ny - 1];
djl = rprime->d[ny - 1];
wil = qprime->w[ny - 1];
wjl = rprime->w[ny - 1];
vi = qprime->v;
vj = rprime->v;
x->w[ny - 1] = wil + wjl;
if (wil + wjl <= 0.0)
x->d[ny - 1] = 0.0;
else
x->d[ny - 1] = ((dil - vi) * wil + (djl - vj) * wjl) / (wil + wjl);
nx = x->index;
nq = qprime->index;
nr = rprime->index;
dil = y->d[nq - 1];
djl = y->d[nr - 1];
wil = y->w[nq - 1];
wjl = y->w[nr - 1];
y->w[nx - 1] = wil + wjl;
if (wil + wjl <= 0.0)
y->d[nx - 1] = 0.0;
else
y->d[nx - 1] = ((dil - vi) * wil + (djl - vj) * wjl) / (wil + wjl);
} /* nudists */
The two long lines are responsible for 24% of the total CPU time. Is there any way to optimize this code and especially the two long lines? Another function which consumes a lot of CPU time is this:
void secondtraverse(node *q, double y, long *nx, double *sum)
{
/* from each of those places go back to all others */
/* nx comes from firsttraverse */
/* sum comes from evaluate via firsttraverse */
double z=0.0, TEMP=0.0;
z = y + q->v;
if (q->tip) {
TEMP = q->d[(*nx) - 1] - z;
*sum += q->w[(*nx) - 1] * (TEMP * TEMP);
} else {
secondtraverse(q->next->back, z, nx, sum);
secondtraverse(q->next->next->back, z, nx,sum);
}
} /* secondtraverse */
The code which calculates the sum is responsible for 18% of the CPU time. Any way to make it run faster?
The complete source code can be found here: http://evolution.genetics.washington.edu/phylip/getme.html

As far as optimizing the big equation lines, you are using some of the most time consuming operations: multiplication and division.
You will have to look for optimizations in a bigger frame, picture or scope. Some ideas:
Fixed Point arithmetic
Eliminating the division for each iteration.
Threading
Mulitple cores
Array, not linked list
Fixed Point Arithmetic
If you can make your numeric base a power of 2, many of your divisions will change into bit shifts. For example, dividing by 16 is the same as right shifting 4 times. Shifts are usually faster than divisions.
Eliminating division per iteration
Rather than performing the division on each iteration, extract it out and perform it less often, perhaps using different values.
If you treat the division as a fraction, you can play with the numerator many times before dividing by the denominator.
Threading
You may want to consider multiple threads. Create threads based on code efficiency. Let one thread be a worker thread that calculates in the background.
Multiple Cores (parallel execution)
The 'x' and 'y' variables appear to be independent of each other. These calculations could be set up for parallel programming. One core or processor performs the 'x' calculation while another core is calculating the 'y' variable.
Think about splitting this at a higher level. One core (thread) processes all the 'x' variables while another core processes the 'y' variables. The results saved independently. Let the main core process all the results after all the 'x' and 'y' variables have been calculated.
Arrays, not lists
Your processor will be happiest when all its data can fit into the processor's data cache. If it can't fit all the data, then fit as much as possible. Thus arrays will have the best chance of fitting into a data cache line than a linked list. The processor will know that an array address sequential and may not have to reload the data cache.

Fast way to convert ARGB texture to next power of 2 texture

We have some old devices that don't support non-pot textures and we have a function that converts ARGB textures to next power of 2 texture. The problem is that it's quite slow and we're wondering if there is a better approach to convert these textures.
void PotTexture()
{
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
std::vector<unsigned char> pottedImageData;
pottedImageData.resize(u2 * v2 * 4);
size_t y, x, c;
for (y = 0; y < imageData.height; y++)
{
for (x = 0; x < imageData.width; x++)
{
for (c = 0; c < 4; c++)
{
pottedImageData[4 * u2 * y + 4 * x + c] = imageData.convertedData[4 * imageData.width * y + 4 * x + c];
}
}
}
imageData.width = u2;
imageData.height = v2;
std::swap(imageData.convertedData, pottedImageData);
}
On some devices this can easily use 100% of the CPU so any optimizations would be amazing. Are there any existing functions that I could look at that perform this conversion?
Edit:
I've optimized the above loop slightly to:
for (y = 0; y < imageData.height; y++)
{
memcpy(
&(pottedImageData[y * u2 * 4]),
&(imageData.convertedData[y * imageData.width * 4]),
imageData.width * 4);
}

Even devices that don't support NPOT texture should support NPOT load.
Create the texture as an exact power of 2 and NO CONTENT using glTexImage2D, passing a null pointer for data.
data may be a null pointer. In this case, texture memory is allocated to accommodate a texture of width width and height height. You can then download subtextures to initialize this texture memory. The image is undefined if the user tries to apply an uninitialized portion of the texture image to a primitive.
Then use glTexSubImage2D to upload a NPOT image, which occupies only a portion of the total texture. This can be done without any CPU-side image rearrangement.

Having had a similar problem in a program I wrote, I took a very different approach. Rather than stretch the source texture, I just copied it into the top left corner of an otherwise empty power-of-two texture.
Then in the pixel shader you use a pair of floats to adjust s,t values so you fetch from just the top left corner.
float sAdjust = static_cast<float>(textureWidth) / static_cast<float>(containerWidth)
float tAdjust = static_cast<float>(textureHeight) / static_cast<float>(containerHeight)
is how you compute them, and to use them you'll get a Vec2 holding the s,t coordinates, just multiply s by sAdjust and t by tAdjust before using them to fetch. If you're using Direct3D, it'd be something akin to this:
D3DXVECTOR4 stAdjust;
stAdjust.x = sAdjust;
stAdjust.y = tAdjust;
// Transfer stAdjust into a float4 inside your pixel shader, call it stAdjust in there
Now in the pixel shader assume you have:
float2 texcoord;
float4 stAdjust;
you just say:
texcoord.x = texcoord.x * stAdjust.x;
texcoord.y = texcoord.y * stAdjust.y;
before using texcoord. Sorry I can't tell you how to do this in GLSL, but you get the general idea.

Okay, the very first optimization can be done here:
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
What you want to do is (for each dimension) find the floor of the logarithm-base2 (log2) and put that into 2**n+1. The standard math library has function log2 but it operates on floating point. But we can use is. 2**n can be written as 1 << n. So this gives
size_t const dim_p2_… = 1 << (int)floor(log2(dim_…)+1);
Better but not yet ideal, because of that float conversion. The Bit Twiddling hacks document has a few functions for integer ilog2: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
But we're still not optimal. Let me introduce you to Compiler intrinsics, which translate into machine instructions, if the machine in question can do it on the metal.
GNU GCC: int __builtin_ffs (unsigned int x), which returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
MSVC++: _BitScanReverse, which returns the length of the run of the most significant bits set zero. So _BitScanReverse is like builtin_ffs - 1 (there's also a builtin_clz which behaves exactly like BitScanReverse.
So we can do
#define ilog2_p1(x) (__builtin_ffs(x))
or
#define ilog2_p1(x) (__BitScanReverse(x)+1)
and use that.
size_t const dim_p2_… = 1 << (int)floor(ilog2_p1(dim_…));
While we're at bit twiddling: We can save that whole ordeal if a texture is already in power of two format. A few years ago I (independently) rediscovered the wonderfully portable bit twiddling trick, exploiting the properties of complement-2 integers. You can also find it in the bit twiddles document. But the type neutral, concise macro form is rarely seen. So here it is:
#define ISPOW2(x) ( (x) && !( (x) & ((x) - 1) ) )
You're using C++ so templates are in order:
template<typename T> bool ispow2(T const x) { return x && !( x & (x - 1) ); }
Then Ben Voight already told you, how to use glTexSubImage2D to load that into the texture. Also have a look at the GL_ARB_texture_rectangle extension, that allows to load NPOT textures, but without the ability for mipmapping and advanced filtering. But it might be a viable choice for you.
If you ever feel the need to scale the texture it's always worth looking into dual spaces. In this case the spatial frequency domain dual space. Upscaling a signal is essentially a impulse response. As such it can be described as a convolution. Convolutions usually are O(n²) in complexity. But due to the Fourier Convolution theorem in Fourier space the equivalent is simple multiplication, so it becomes O(n). FFT can be done with O(n log n), so the total complexity is about O(n + 2n log n), which is much better.

Gram matrix using VexCL

I have a pretty large data (does not fit into a GPU memory) containing many vectors where each vector is several MBs.
I'd like to calculate, using multiple GPU devices, the Gram matrix using a gaussian kernel.
In other words, for every pair of vectors x,y, I need to calculate the norm of x-y. So if I have N vectors, I have (N^2+N)/2 such pairs. I don't care about saving space or time by taking advantage of the symmetry, it can do the whole N^2.
How can I do it with VexCL? I know its the only library supporting multiple GPUs, and I did pretty much doing it effectively with plain OpenCL with no success so far.
Please note that the dataset won't even fit the machine's RAM, I'm reading blocks of vectors from a memory mapped file.
Thanks much!!

You will obviously need to split your vectors into groups of, say, m, load the groups one by one (or, rather, two by two) onto your GPUs and do the computations. Here is a complete program that does the computation (as I understood it) for the two currently loaded chunks:
#include <vexcl/vexcl.hpp>
int main() {
const size_t n = 1024; // Each vector size.
const size_t m = 4; // Number of vectors in a chunk.
vex::Context ctx( vex::Filter::Count(1) );
// The input vectors...
vex::vector<double> chunk1(ctx, m * n);
vex::vector<double> chunk2(ctx, m * n);
// ... with some data.
chunk1 = vex::element_index();
chunk2 = vex::element_index();
vex::vector<double> gram(ctx, m * m); // The current chunk of Gram matrix to fill.
/*
* chunk1 and chunk2 both have dimensions [m][n].
* We want to take each of chunk2 m rows, subtract those from each of
* chunk1 rows, and reduce the result along the dimension n.
*
* In order to do this, we create two virtual 3D matrices (x and y below,
* those are just expressions and are never instantiated) sized [m][m][n],
* where
*
* x[i][j][k] = chunk1[i][k] for each j, and
* y[i][j][k] = chunk2[j][k] for each i.
*
* Then what we need to compute is
*
* gram[i][j] = sum_k( (x[i][j][k] - y[i][j][k])^2 );
*
* Here it goes:
*/
using vex::extents;
auto x = vex::reshape(chunk1, extents[m][m][n], extents[0][2]);
auto y = vex::reshape(chunk2, extents[m][m][n], extents[1][2]);
// The single OpenCL kernel is generated and launched here:
gram = vex::reduce<vex::SUM>(
extents[m][m][n], // The dimensions of the expression to reduce.
pow(x - y, 2.0), // The expression to reduce.
2 // The dimension to reduce along.
);
// Copy the result to host, spread it across your complete gram matrix.
// I am lazy though, so let's just dump it to std::cout:
std::cout << gram << std::endl;
}
I suggest you load chunk1 once, then in sequence load all chunk2 variants and do the computations, then load next chunk1, etc. etc. Note that slicing, reshaping, and multidimensional reduction operations are only supported for a context with a single compute device in it. So what is left is how to spread the computations across all of your compute devices. The easiest way to do this is probably to create single VexCL context that would grab all available GPUs, and then create vectors of command queues out of it:
vex::Context ctx( vex::Filter::Any );
std::vector<std::vector<vex::command_queue>> q;
for(size_t d = 0; d < ctx.size(); ++d)
q.push_back({ctx.queue(d)});
//...
// In a std::thread perhaps:
chunk1(q[d], m * n);
chunk2(q[d], m * n);
// ...
I hope this is enough to get you started.

Can my loop be optimized any more?

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?
I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.
Any help is appreciated!
Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)
typedef double real;
struct Particle
{
Vector pos, vel, acc, jerk;
Vector oldPos, oldVel, oldAcc, oldJerk;
real mass;
};
class Vector
{
private:
real vec[3];
public:
// Operators defined here
};
real Gravity::interact(Particle *p, size_t numParticles)
{
PROFILE_FUNC();
real tau_q = 1e300;
for (size_t i = 0; i < numParticles; i++)
{
p[i].jerk = 0;
p[i].acc = 0;
}
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
Vector r = p[j].pos - p[i].pos;
Vector v = p[j].vel - p[i].vel;
real r2 = lengthsq(r);
real v2 = lengthsq(v);
// Calculate inverse of |r|^3
real r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
Vector da = r * r3i;
Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
real mij = p[i].mass + p[j].mass;
real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
real tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
return sqrt(sqrt(tau_q));
}

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.
4a. Try replacing the if (...) tau_q = with
tau_q=minimum(...,...)
Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.
4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]
[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.
Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.
If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.
Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.
Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.
Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.
One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.
If you move the mass out too, you could store things like this:
struct ParticleData
{
Vector pos, vel, acc, jerk;
};
ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...
then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.
If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g
struct ParticleData
{
// data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
Vector4 data_x;
// data_y[0] == pos.y, data_y[1] = vel.y, etc.
Vector4 data_y;
// data_z[0] == pos.z, data_y[1] = vel.z, etc.
Vector4 data_z;
};
where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)
Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.
With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.
SSE example:
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp
Novel cache algorithm, explanation and example code:
http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp
You might also find the following deck I gave at Seattle Code Camp interesting:
http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/
Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?
What's real? Can it be a float?
Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

Would you benefit from the famous "fast inverse square root" algorithm?
float InvSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.
Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.
Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.
Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.
You may want to profile to see if this is really the bottleneck first.

Thing I look for is branching, they tend to be performance killers.
You can use loop unrolling.
also, remember multiple with smaller parts of the problem :-
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache
You could thread this to make better use of multiple cores
you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....
but really need to know where its costing you the most

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.
Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
r = p[j].pos - p[i].pos;
v = p[j].vel - p[i].vel;
r2 = lengthsq(r);
v2 = lengthsq(v);
// Calculate inverse of |r|^3
r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
da = r * r3i;
dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
mij = p[i].mass + p[j].mass;
tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}

You can replace any occurrence of:
a = b/c
d = e/f
with
icf = 1/(c*f)
a = bf*icf
d = ec*icf
if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.
You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js