Intrisic store - bad performance

Intrisic store - bad performance - c++

I want to write benchmark for Xeon Phi (60 core). In my program i use the OpenMP standard and Intel intrinsics. I implemented parallel version of algorithm (5-point stencil computation) which is faster under 230 times than scalar algorithm. I want add SIMD to parallel code. I have problem with performance. When i call _m512_store_pd() performance of computations is reduced and parallel version with SIMD is slower than version without SIMD. What is the problem? What should I do to get better performance?
for(int i=start; i<stop; i+=threadsPerCore)
{
for(int j=8; j<n+8; j+=8)
{
__m512d v_c = _mm512_load_pd(&matrixIn[i * n_real + j]);
__m512d v_g = _mm512_load_pd(&matrixIn[(i - 1) * n_real + j]);
__m512d v_d = _mm512_load_pd(&matrixIn[(i + 1) * n_real + j]);
__m512d v_l = _mm512_loadu_pd(&matrixIn[i * n_real + (j - 1)]);
__m512d v_p = _mm512_loadu_pd(&matrixIn[i * n_real + (j + 1)]);
__m512d v_max = _mm512_max_pd(v_c, v_g);
v_max = _mm512_max_pd(v_max, v_d);
v_max = _mm512_max_pd(v_max, v_l);
v_max = _mm512_max_pd(v_max, v_p);
_mm512_store_pd(&matrixOut[i * n_real + j], v_max);
}
}
I start computation from 8 becouse i have one vector at the beginning and one vector at the end are halo elements. n_real is size of vector -> n + 16. start and stop are computed, becouse i partition matrix for 60 cores and opne part (m/60) is computed by 4 HM threads.

Someone (maybe you) seems to have asked an identical question (at least, the code sample quoted is the same as yours) on Intel Developer Zone at https://software.intel.com/en-us/forums/topic/531721 where there are answers (including a rewrite that got 40% performance improvement).
Perhaps reading that would be useful?
(If it was you, I see no objection to asking in both places, but it would be polite to tell people here that you have already asked there, so that they don't waste time reproducing answers that people have already given in the other forum).

Related

Non-maximum suppression in Canny's algorithm: optimize with SSE

I have the "honor" to improve the runtime of the following code of someone else. (it's a non-maximum supression from the canny - algorithm"). My first thought was to use SSE-intrinsic code, i'm very new in this area, so my question is.
Is there any chance to do this?
And if so, can someone give me a few hints?
void vNonMaximumSupression(
float* fpDst,
float const*const fpMagnitude,
unsigned char const*const ucpGradient, ///< [in] 0 -> 0°, 1 -> 45°, 2 -> 90°, 3 -> 135°
int iXCount,
int iXOffset,
int iYCount,
int ignoreX,
int ignoreY)
{
memset(fpDst, 0, sizeof(fpDst[0]) * iXCount * iXOffset);
for (int y = ignoreY; y < iYCount - ignoreY; ++y)
{
for (int x = ignoreX; x < iXCount - ignoreX; ++x)
{
int idx = iXOffset * y + x;
unsigned char dir = ucpGradient[idx];
float fMag = fpMagnitude[idx];
if (dir == 0 && fpMagnitude[idx - 1] < fMag && fMag > fpMagnitude[idx + 1] ||
dir == 1 && fpMagnitude[idx - iXCount + 1] < fMag && fMag > fpMagnitude[idx + iXCount - 1] ||
dir == 2 && fpMagnitude[idx - iXCount] < fMag && fMag > fpMagnitude[idx + iXCount] ||
dir == 3 && fpMagnitude[idx - iXCount - 1] < fMag && fMag > fpMagnitude[idx + iXCount + 1]
)
fpDst[idx] = fMag;
else
fpDst[idx] = 0;
}
}
}

Discussion
As #harold noted, the main problem for vectorization here is that the algorithm uses different offset for each pixel (specified by direction matrix). I can think of several potential ways of vectorization:
#nikie: evaluate all branches at once, i.e. compare each pixel with all of its neighbors. The results of these comparisons are blended according to the direction values.
#PeterCordes: load a lot of pixels into SSE registers, then use _mm_shuffle_epi8 to choose only the neighbors in the given direction. Then perform two vectorized comparisons.
(me): use scalar instructions to load the proper two neighboring pixels along the direction. Then combine these values for four pixels into SSE registers. Finally, do two comparisons in SSE.
The second approach is very hard to implement efficiently, because for a pack of 4 pixels, there are 18 neighboring pixels to choose from. That would require too much shuffles, I think.
The first approach looks nice, but it would perform four times more operations per pixel. I suppose the speedup of vector instructions would be overwhelmed by too much computations done.
I suggest using the third approach. Below you can see hints on improving performance.
Hybrid approach
First of all, we want to make scalar code as fast as possible. The code presented by you contains too much branches. Most of them are not predictable, for instance the switch by direction.
In order to remove branches, I suggest creating an array delta = {1, stride - 1, stride, stride + 1}, which gives index offset from direction. By using this array, you can find indices of neighboring pixels to compare with (without branches). Then you do two comparisons. Finally, you can write a ternary operator like res = (isMax ? curr : 0);, hoping that compiler can generate branchless code for it.
Unfortunately, compiler (at least MSVC2013) is not smart enough to avoid branch by isMax. That's why we can benefit from rewriting the inner cycle with scalar SSE intrinsics. Look up the guide for reference. You need mostly intrinsics ending with _ss, since the code is completely scalar.
Finally, we can vectorize everything except for loading neighboring pixels. In order to load neighboring pixels, we can use _mm_setr_ps intrinsics with scalar arguments, asking the compiler to generate some good code for us =)
__m128 forw = _mm_setr_ps(src[idx+0 + offset0], src[idx+1 + offset1], src[idx+2 + offset2], src[idx+3 + offset3]);
__m128 back = _mm_setr_ps(src[idx+0 - offset0], src[idx+1 - offset1], src[idx+2 - offset2], src[idx+3 - offset3]);
I have just implemented it myself. Tested in single thread on Ivy Bridge 3.4Ghz. A random image of 1024 x 1024 resolution was used as a source. The results (in milliseconds) are:
original: 13.078 //your code
branchless: 8.556 //'branchless' code
scalarsse: 2.151 //after rewriting to sse intrinsics
hybrid: 1.159 //partially vectorized code
They confirm performance improvements on each step. The final code needs a bit more than one millisecond to process a one-megapixel image. The total speedup is about 11.3 times. Indeed, you can get better performance on GPU =)
I hope the presented information would be enough for you to reproduce the steps. If you are seeking terrible spoilers, look here for my implementations of all these stages.

Optimize mathematical expressions

I am frustrated with how much time the fitch software take to do some simple computations. I profiled it with Intel VTune and it seems that 52% of the CPU time is spent in the nudists() function:
void nudists(node *x, node *y)
{
/* compute distance between an interior node and tips */
long nq=0, nr=0, nx=0, ny=0;
double dil=0, djl=0, wil=0, wjl=0, vi=0, vj=0;
node *qprime, *rprime;
qprime = x->next;
rprime = qprime->next->back;
qprime = qprime->back;
ny = y->index;
dil = qprime->d[ny - 1];
djl = rprime->d[ny - 1];
wil = qprime->w[ny - 1];
wjl = rprime->w[ny - 1];
vi = qprime->v;
vj = rprime->v;
x->w[ny - 1] = wil + wjl;
if (wil + wjl <= 0.0)
x->d[ny - 1] = 0.0;
else
x->d[ny - 1] = ((dil - vi) * wil + (djl - vj) * wjl) / (wil + wjl);
nx = x->index;
nq = qprime->index;
nr = rprime->index;
dil = y->d[nq - 1];
djl = y->d[nr - 1];
wil = y->w[nq - 1];
wjl = y->w[nr - 1];
y->w[nx - 1] = wil + wjl;
if (wil + wjl <= 0.0)
y->d[nx - 1] = 0.0;
else
y->d[nx - 1] = ((dil - vi) * wil + (djl - vj) * wjl) / (wil + wjl);
} /* nudists */
The two long lines are responsible for 24% of the total CPU time. Is there any way to optimize this code and especially the two long lines? Another function which consumes a lot of CPU time is this:
void secondtraverse(node *q, double y, long *nx, double *sum)
{
/* from each of those places go back to all others */
/* nx comes from firsttraverse */
/* sum comes from evaluate via firsttraverse */
double z=0.0, TEMP=0.0;
z = y + q->v;
if (q->tip) {
TEMP = q->d[(*nx) - 1] - z;
*sum += q->w[(*nx) - 1] * (TEMP * TEMP);
} else {
secondtraverse(q->next->back, z, nx, sum);
secondtraverse(q->next->next->back, z, nx,sum);
}
} /* secondtraverse */
The code which calculates the sum is responsible for 18% of the CPU time. Any way to make it run faster?
The complete source code can be found here: http://evolution.genetics.washington.edu/phylip/getme.html

As far as optimizing the big equation lines, you are using some of the most time consuming operations: multiplication and division.
You will have to look for optimizations in a bigger frame, picture or scope. Some ideas:
Fixed Point arithmetic
Eliminating the division for each iteration.
Threading
Mulitple cores
Array, not linked list
Fixed Point Arithmetic
If you can make your numeric base a power of 2, many of your divisions will change into bit shifts. For example, dividing by 16 is the same as right shifting 4 times. Shifts are usually faster than divisions.
Eliminating division per iteration
Rather than performing the division on each iteration, extract it out and perform it less often, perhaps using different values.
If you treat the division as a fraction, you can play with the numerator many times before dividing by the denominator.
Threading
You may want to consider multiple threads. Create threads based on code efficiency. Let one thread be a worker thread that calculates in the background.
Multiple Cores (parallel execution)
The 'x' and 'y' variables appear to be independent of each other. These calculations could be set up for parallel programming. One core or processor performs the 'x' calculation while another core is calculating the 'y' variable.
Think about splitting this at a higher level. One core (thread) processes all the 'x' variables while another core processes the 'y' variables. The results saved independently. Let the main core process all the results after all the 'x' and 'y' variables have been calculated.
Arrays, not lists
Your processor will be happiest when all its data can fit into the processor's data cache. If it can't fit all the data, then fit as much as possible. Thus arrays will have the best chance of fitting into a data cache line than a linked list. The processor will know that an array address sequential and may not have to reload the data cache.

Optimization of double subtraction in C++

I have the following code that I use to compute the distance between two vectors:
double dist(vector<double> & vecA, vector<double> & vecB){
double curDist = 0.0;
for (size_t i = 0; i < vecA.size(); i++){
double dif = vecA[i] - vecB[i];
curDist += dif * dif;
}
return curDist;
}
This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:
double dif = vecA[i] - vecB[i];
is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?
Notes:
To profile my application I have used Intel Amplifier XE;
Reducing the number of distance computations is not a feasible solution for
me;

There are two possible issues I can think of right now:
This computation is memory bound.
There is an iteration-to-iteration dependency on curDist.
This computation is memory bound.
Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.
There is an iteration-to-iteration dependency on curDist.
You have a dependency on curDist. This will block vectorization by the compiler. (Also, don't always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)
Normally, the compiler vectorizer can split up the curDist into multiple partial sums to and unroll/vectorize the loop. But it can't do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven't already. Or you can split the sum and unroll it yourself.
For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point:
double curDist0 = 0.0;
double curDist1 = 0.0;
double curDist2 = 0.0;
double curDist3 = 0.0;
for (size_t i = 0; i < vecA.size() - 3; i += 4){
double dif0 = vecA[i + 0] - vecB[i + 0];
double dif1 = vecA[i + 1] - vecB[i + 1];
double dif2 = vecA[i + 2] - vecB[i + 2];
double dif3 = vecA[i + 3] - vecB[i + 3];
curDist0 += dif0 * dif0;
curDist1 += dif1 * dif1;
curDist2 += dif2 * dif2;
curDist3 += dif3 * dif3;
}
// Do some sort of cleanup in case (vecA.size() % 4 != 0)
double curDist = curDist0 + curDist1 + curDist2 + curDist3;

You could eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also do loop unrolling to give yourself more computation per loop iteration. What compiler are you using, and what optimization settings? Compiler will often do unrolling for you, but you could manually do it.

If it's feasible (if the range of the numbers isn't huge) you may want to explore using fixed point to store these numbers, rather than doubles.
Fixed point would turn these into int operations rather than double operations.
Another interesting thing is that assuming your profile is correct, the lookups seems to be a significant factor (otherwise the multiplication would likely be more costly than the subtractions).
I'd try using a const vector iterator rather than the random access lookup. It may help in two ways: 1 - it is constant, and 2 - the serial nature of the iterator may let the processor do better caching.

If your platform does not have (or is not using) an ALU that supports floating point math, floating point libraries, by nature, are slow and consume additional non-volatile memory. I suggest instead using 32-bit (long) or 64-bit (long long) fixed-point arithmetic. Then convert the final result to floating point at the end of the algorithm. I did this on a project a couple years ago to improve the performance of an I2T algorithm and it worked wonderfully.

__builtin_prefetch, How much does it read?

I'm trying to optimize some C++ (RK4) by using
__builtin_prefetch
I can't figure out how to prefetch a whole structure.
I don't understand how much of the const void *addr is read. I want to have the next values of from and to loaded.
for (int i = from; i < to; i++)
{
double kv = myLinks[i].kv;
particle* from = con[i].Pfrom;
particle* to = con[i].Pto;
//Prefetch values at con[i++].Pfrom & con[i].Pto;
double pos = to->px- from->px;
double delta = from->r + to->r - pos;
double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function
double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;
double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;
double k4 = axcel(kv, delta + k3, from->mass) * dt;
#define likely(x) __builtin_expect((x),1)
if (likely(!from->bc))
{
from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);
}
}
Link: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.
And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.
Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).
If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).
Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.
(added in january 2018:)
Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.

It reads a cache line. Cache line size may vary, but it is most likely to be 64 bytes on modern CPUs. If you need to read multiple cache lines, check out prefetch_range.

Can my loop be optimized any more?

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?
I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.
Any help is appreciated!
Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)
typedef double real;
struct Particle
{
Vector pos, vel, acc, jerk;
Vector oldPos, oldVel, oldAcc, oldJerk;
real mass;
};
class Vector
{
private:
real vec[3];
public:
// Operators defined here
};
real Gravity::interact(Particle *p, size_t numParticles)
{
PROFILE_FUNC();
real tau_q = 1e300;
for (size_t i = 0; i < numParticles; i++)
{
p[i].jerk = 0;
p[i].acc = 0;
}
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
Vector r = p[j].pos - p[i].pos;
Vector v = p[j].vel - p[i].vel;
real r2 = lengthsq(r);
real v2 = lengthsq(v);
// Calculate inverse of |r|^3
real r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
Vector da = r * r3i;
Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
real mij = p[i].mass + p[j].mass;
real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
real tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
return sqrt(sqrt(tau_q));
}

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.
4a. Try replacing the if (...) tau_q = with
tau_q=minimum(...,...)
Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.
4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]
[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.
Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.
If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.
Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.
Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.
Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.
One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.
If you move the mass out too, you could store things like this:
struct ParticleData
{
Vector pos, vel, acc, jerk;
};
ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...
then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.
If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g
struct ParticleData
{
// data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
Vector4 data_x;
// data_y[0] == pos.y, data_y[1] = vel.y, etc.
Vector4 data_y;
// data_z[0] == pos.z, data_y[1] = vel.z, etc.
Vector4 data_z;
};
where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)
Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.
With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.
SSE example:
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp
Novel cache algorithm, explanation and example code:
http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp
You might also find the following deck I gave at Seattle Code Camp interesting:
http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/
Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?
What's real? Can it be a float?
Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

Would you benefit from the famous "fast inverse square root" algorithm?
float InvSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.
Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.
Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.
Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.
You may want to profile to see if this is really the bottleneck first.

Thing I look for is branching, they tend to be performance killers.
You can use loop unrolling.
also, remember multiple with smaller parts of the problem :-
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache
You could thread this to make better use of multiple cores
you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....
but really need to know where its costing you the most

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.
Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
r = p[j].pos - p[i].pos;
v = p[j].vel - p[i].vel;
r2 = lengthsq(r);
v2 = lengthsq(v);
// Calculate inverse of |r|^3
r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
da = r * r3i;
dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
mij = p[i].mass + p[j].mass;
tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}

You can replace any occurrence of:
a = b/c
d = e/f
with
icf = 1/(c*f)
a = bf*icf
d = ec*icf
if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.
You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js