How good is OpenCV GPU library for matrix operations?

How good is OpenCV GPU library for matrix operations? - c++

I'm using OpenCV for an application in computer vision. I'd like to accelerate some matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 2.4.1 has a number of GPU accelerated functions. How well do they perform in your experience? Am I better off using another library (e.g. Thrust) instead?
EDIT
Sample application: Calculate squared Euclidean distance matrix on GPU. Currently, my GPU accelerated (and vectorized) implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C++ implementation with OpenCV.
Matlab implementation:
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);
[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);
pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);
% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';
end
UPDATE Here's another Matlab implementation that accomplishes the same (thanks to https://stackoverflow.com/a/7774323/1121420). But it runs only on CPU because bsxfun is not supported by PCT. Still looking for C++ alternative though.
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.
K = bsxfun(#plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');
end

I find ArrayFire to be much faster and have started using it instead of the GPU kernels in OpenCV for image processing. Here are some benchmarks I found comparing ArrayFire (used to be in a different interface called LibJacket) to OpenCV and it's been true in my benchmarking too that ArrayFire is 2-4X faster than the GPU functions in OpenCV. From what I hear, NVIDIA didn't write the GPU kernels in OpenCV but contracted those out to someone, which may be why they are so slow. Since I'm only using 1 GPU, I can use ArrayFire for free.
Update, given the new MATLAB code posted by #Alex: I ran the benchmark of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the CPU, but Jacket and ArrayFire kick butt. HW specs are:
Intel(R) Xeon(R) CPU X5660 # 2.80GHz
NVIDIA Tesla M2090
Results of CPU vs GPU using Parallel Computing Toolbox gpuArray (fully warmed up). CPU is faster than PCT gpuArray:
>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.
Results of CPU vs GPU using Jacket (fully warmed up). Jacket beats PCT gpuArray by 3.7X and beats the CPU by 3X
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.
Here is the modified code that let's you run all that easily:
function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';
end
Jacket does support BSXFUN on the GPU, and it does improve the speeds somewhat:
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.
Note that the sizes used here are pretty small, so most CUDA code that attempts to run on these small sizes is likely to perform poorly. That's why I like to use AccelerEyes' stuff, because those guys have optimized the heck out of the GPU, unlike PCT gpuArray, Thrust, OpenCV, each of which I've tried in the past.
Here is the ArrayFire Free C++ results:
Time: 0.0003577 seconds
Speedups: 19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN
Here is the ArrayFire code I wrote for this:
static array SqEuclideanDist(array P, array Q)
{
// 0 based indexing
array pmag = sum(P * P, 1);
array qmag = sum(Q * Q, 1);
int np = P.dims(0);
int nq = Q.dims(0);
array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
return K;
}
int main(int argc, char **argv)
{
double *P_cpu = new double[1581 * 3];
double *Q_cpu = new double[189 * 3];
array P = array(1581, 3, P_cpu);
array Q = array(189 , 3, Q_cpu);
af::sync();
int iter = 1000;
timer::tic();
for (int i = 0; i < iter; i++) {
array K = SqEuclideanDist(P, Q);
af::eval(K);
}
af::sync();
printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);
delete[] P_cpu;
delete[] Q_cpu;
}

They've been contributed by NVidia, so does have good performance on CUDA compatible cards.
The real performance depends on the card itself and the function you are using.
In my experience only cvRotate and cvResize had a better performance than a normal Intel cpu.
(Note: I was only interested in image related functions)

Related

How would I use execute code to solve matrices while measuring the runtime of the code?

Preferably I would use C++ to execute the code, but I would be open to any suggestion for a better language for the situation. I essentially want to use Strassen's algorithm to solve matrices, and I want to know how I would solve a matrices and measure its runtime.
# Version 3.6
import numpy as np
def split(matrix):
"""
Splits a given matrix into quarters.
Input: nxn matrix
Output: tuple containing 4 n/2 x n/2 matrices corresponding to a, b, c, d
"""
row, col = matrix.shape
row2, col2 = row//2, col//2
return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2],
matrix[row2:, col2:]
def strassen(x, y):
"""
Computes matrix product by divide and conquer approach, recursively.
Input: nxn matrices x and y
Output: nxn matrix, product of x and y
"""
# Base case when size of matrices is 1x1
if len(x) == 1:
return x * y
# Splitting the matrices into quadrants. This will be done recursively
# untill the base case is reached.
a, b, c, d = split(x)
e, f, g, h = split(y)
# Computing the 7 products, recursively (p1, p2...p7)
p1 = strassen(a, f - h)
p2 = strassen(a + b, h)
p3 = strassen(c + d, e)
p4 = strassen(d, g - e)
p5 = strassen(a + d, e + h)
p6 = strassen(b - d, g + h)
p7 = strassen(a - c, e + f)
# Computing the values of the 4 quadrants of the final matrix c
c11 = p5 + p4 - p2 + p6
c12 = p1 + p2
c21 = p3 + p4
c22 = p1 + p5 - p3 - p7
# Combining the 4 quadrants into a single matrix by stacking horizontally and vertically.
c = np.vstack((np.hstack((c11, c12)), np.hstack((c21, c22))))
return c
I found the code above for the algorithm.
#include <time.h>
int main(void) {
clock_t tStart = clock();
/* Do your stuff here */
printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
return 0;
}
I found this code for measuring the runtime of the code. However I saw I could use
/usr/bin/time ./MyProgram if I have cygwin installed.
In short, how would I use my code to solve actual matrices using Strassen's algorithm and other matrice solving algorithms? Also how would I run the code? Thank you for the help, I am a novice to coding, and I am doing this in order to test the algorithmic efficiency of different matrice solving algorithms in different scenarios.

Time measurement
measuring of time depends on platform so what OS? On windows I would use Performance Counters. If you got access to x86 assembly you could also use RDTSC instruction but that needs some knowledge to use properly like setting affinity to single CPU, obtaining and stabilizing CPU frequency etc.
The OS time granularity is an issue too so if your measured process is too short you might need of some filtering of multiple measurements in order to get the correct values.
You can avoid some of the problems by measuring multiple repetitions of the process so the times get above 100ms and divide the resulting time by number of repetitions.
Also while reusing the same code/data while measuring the CACHE can be a problem too messing up your results.
running the code
The code of yours looks like Python so you can not use that in C/C++ directly instead you need to invoke it somehow with Python interpreter for example by creating python process with parameters telling it to open and run your source code. However in this case you need to wait until the code finishes by scanning its Handle if it is still valid... Of coarse you need to write and execute your python stuff in a way that when finished it closes. However I am afraid This will add big overhead as starting/stopping the python process alone could be much much slower than the matrix multiplication you measure ...
Other option is to having it in DLL or OBJ form and importing that into your C/C++ instead (however not sure if that is possible for Python code). This way you just have to call a function within your C/C++ app so no problems there ...
For some inspiration see:
Builder C++ calling VC++ class
In case the code is not too complex or does not require other libs and stuff you can try to port it to C/C++ code and use it directly.

Neural Network in c++ only outputting 0.5

Im trying to build a NN in C++,
it is being trained on the MNIST handwritten numbers data set to classify a number from a 28*28 black and white image, i have done the same problem in python that worked with a decent success rate but i am trying to do it in c++ for fun, it has 784 inputs (1 for each pixel), 100 hidden and 10 output nodes, it has no biases and a learning rate of 0.3 this is the same as in the python one.
The NN takes in a vector of 784 pixel values normalised from 0.01 to 0.99 in grey scale of an image of a handwritten number and outputs a 10 dimensional vector to identify the image.
However in c++ it always outputs roughly 0.5 for all of the nodes after training. I have tested it with more hidden nodes, different learning rates and more training data but the more training data I trained it on the closer it gets to 0.5. I have also tested that all the functions and overloads work as intended. None of these have helped.
I understand that a CNN would probably work better in this situation but i do not fully understand how to program one.
Here is the training function:
void train(std::vector<double> inputs, std::vector<double> targets)
{
std::vector<std::vector<double>> inputs2D = {inputs};
std::vector<std::vector<double>> targets2D = {targets};
inputs2D = utils::transpose(inputs2D);
targets2D = utils::transpose(targets2D);
std::vector<std::vector<double>> hidden_in = utils::dot(ih_weight, inputs2D);
std::vector<std::vector<double>> hidden_out = utils::activation(hidden_in);
std::vector<std::vector<double>> final_in = utils::dot(ho_weight, hidden_out);
std::vector<std::vector<double>> final_out = utils::activation(final_in);
std::vector<std::vector<double>> output_error = targets2D + (-1* final_out);
std::vector<std::vector<double>> hidden_error = utils::dot(utils::transpose(ho_weight), output_error);
ho_weight = (learningRate * utils::dot((output_error * final_out * (1.0 + (-1.0 * final_out))), utils::transpose(hidden_out))) + ho_weight;
ih_weight = (learningRate * utils::dot((hidden_error * hidden_out * (1.0+(-1.0 * hidden_out))), utils::transpose(inputs2D))) + ih_weight;
}
and here is the full code and training data: here
it also includes the python file, and the c++ was compiled using c++ 11 with g++
I'm quite new to C++ so feel free to recommend any changes that would make my code better.

Arrayfire Vectorization

I'm trying to speed up the following calculations but have not been able to reach the desired speed. Im sure the issue is with my code and not physical limitations of the GPU.
I have a matrix V that is 10,000 x 6 x 6.
And another matrix P that is 6 x 1,000
Both complex
I need to do V * P (which should results in 10,000 x 6 x 1000)
Take the magnitude (or mag sq) of it and then sum in the 6 dimension.
resulting in a 10,000 x 1000 of real values.
I have tried the following:
af::array V{ 10000, 6, 6, c32 };
af::array P{ 6, 1000, c32 };
af::array VP = af::matmul(V, P); (results in 10,000x1000x6 - ok, as long as i still sum in the 6 dim)
af::array res = af::sum(af::abs(VP),2);
This was not nealy fast enough. Then I tried converting V into an array, so I had:
af::array V[6] = { af::array{ 10000, 6, c32 },
af::array{ 10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 } };
af::array VP[6];
af::array res;
for (int i = 0; i < 6; i++)
{
VP[i] = af::matmul(V[i], P);
}
res= af::abs(mCalledData[0]);
for (int i = 1; i < 6; i++)
{
res+= af::abs(VP[i]);
}
This had about a 2x speedup. I came up with another solution but af::matmult that takes in 3 arrays doesn't support options (like hermitian) and doesn't support gfor, so I couldn't try that route.
Currently, the matrix multiply (in both approaches) takes about 2.2ms and it looks like arrayfire can combine the abs and sum into one JIT kernel that takes about 2 ms.
My knowledge of arrayfire is limited so i'm guessing there is something I'm not thinking of. Does anyone have an idea of how I can increase the speed of this algorithm?
Thank you!

I can confirm your findings that looped version is about twice as fast as the batched matmul. Matmul on its own is not essentially the one taking long runtime in your code snippet, it is the other operation of summing up along third dimension after abs which is costly. It is due to the following reasons.
1) sum(abs(result)) - abs is again not issue here. Sum is reduction algorithm, which are usually quite fast along the fast moving dimension. However, reduction along higher dimension the element stride is size of the matrix for successive elements. This expensive compared to reduction along continuous locations.
2) looped abs additions - This version is however is accessing elements that continuous in memory because, we are basically adding respective elements of 6 matrices. On top of this, the entire loop (along with abs OP) will be converted into a single JIT kernel that does the following which is very efficient.
res = res + ptr0[i] + ptr1[i] + ptr2[i] + ptr0[i] + ptr1[i]
Above line is just for illustration, that is not the exact JIT kernel.
Hence, the batched version is faster than looped version in this specific case because of the reduction operation that is being done on the result of matmul.
My test GPU: GTX 1060
The matmul itself for a single [10k x 6] * [6 x 1k] is about half a millisecond on GTX 1060. Six such matmuls can't be done under millisecond on my GTX 1060 at least I would think. What is your target runtime ?
EDITED (Jan 10, 2020): - Actually, This won't work because of abs operation on result of each matmul.
You can try looking into our latest entry into gemm category in master branch of ArrayFire. However, you would have to build arrayfire from source until our next feature release 3.7. You can look at the documentation at the following page.
https://github.com/arrayfire/arrayfire/blob/master/include/af/blas.h#L230
It follows the principle of Carray from cuBLAS gemm API.

How do you implement a calculated Gaussian kernel?

I am struggling with my ability to implement a calculated gaussian kernel to return a blurred image.
My current code that calculates the kernel is below:
const int m = 5;
const int n = 5;
double sigma = std;
Mat Gauss;
double kernel[m][n];
for ( int x = 0; x < m; ++x )
for ( int y = 0; y < n; ++y )
{
kernel[x][y] = (1 / (sigma * (sqrt(2 * M_PI))))
* exp(-0.5 * (std::pow((x - avg) / sigma, 2.0)
+ pow((y - avg) / sigma, 2.0) ) / (2 * M_PI * sigma * sigma));
}
However, I can't figure out how to apply this to the image in a way that I am returned a blurred image.
I would appreciate it if anyone could give me some pointers in a way that I can apply this to an image.
I was thinking of using a for loop to replace the pixels of the original image but I could not properly implement this idea.
Thank you for your time.

It sounds like you want to compute a convolution of the original image with a Gaussian kernel, something like this:
blurred[x][y] = Integral (kernel[s][t] * original[x-s][y-t]) ds dt
There are a number of techniques for that:
Direct convolution: go through the grid and compute the above integral at each point. This works well for kernels with very small support, on the order of 5 grid points in each direction, but for kernels with larger support becomes too slow. For Gaussian kernels a rule of thumb for truncating support is about 3*sigma, so it's not unreasonable to do direct convolution with sigma under 2 grid points.
Fast Fourier Transform (FFT). This works reasonable fast for any kernel. Therefore FFT became the standard way to compute convolution of nearly anything with nearly anything. Direct convolution beats FFT only for kernel with very small support.
Analytical: integrals of some kernels have analytical expressions. In particular, integral of a Gaussian is the Erf function, and, at least on Unix systems, it's available as a function call. Moreover, on some hardware (such as GPUs) Erf is implemented in hardware. In some rare (but important) cases of coarse bi-level images one can replace convolution with Gaussian with a loop of Erf function calls.
For most computational system your best bet would be to go with FFT: it's fast and it's flexible enough to handle correctly any kernels and images.

Can my loop be optimized any more?

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?
I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.
Any help is appreciated!
Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)
typedef double real;
struct Particle
{
Vector pos, vel, acc, jerk;
Vector oldPos, oldVel, oldAcc, oldJerk;
real mass;
};
class Vector
{
private:
real vec[3];
public:
// Operators defined here
};
real Gravity::interact(Particle *p, size_t numParticles)
{
PROFILE_FUNC();
real tau_q = 1e300;
for (size_t i = 0; i < numParticles; i++)
{
p[i].jerk = 0;
p[i].acc = 0;
}
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
Vector r = p[j].pos - p[i].pos;
Vector v = p[j].vel - p[i].vel;
real r2 = lengthsq(r);
real v2 = lengthsq(v);
// Calculate inverse of |r|^3
real r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
Vector da = r * r3i;
Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
real mij = p[i].mass + p[j].mass;
real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
real tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
return sqrt(sqrt(tau_q));
}

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.
4a. Try replacing the if (...) tau_q = with
tau_q=minimum(...,...)
Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.
4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]
[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.
Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.
If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.
Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.
Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.
Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.
One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.
If you move the mass out too, you could store things like this:
struct ParticleData
{
Vector pos, vel, acc, jerk;
};
ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...
then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.
If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g
struct ParticleData
{
// data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
Vector4 data_x;
// data_y[0] == pos.y, data_y[1] = vel.y, etc.
Vector4 data_y;
// data_z[0] == pos.z, data_y[1] = vel.z, etc.
Vector4 data_z;
};
where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)
Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.
With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.
SSE example:
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp
Novel cache algorithm, explanation and example code:
http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp
You might also find the following deck I gave at Seattle Code Camp interesting:
http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/
Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?
What's real? Can it be a float?
Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

Would you benefit from the famous "fast inverse square root" algorithm?
float InvSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.
Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.
Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.
Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.
You may want to profile to see if this is really the bottleneck first.

Thing I look for is branching, they tend to be performance killers.
You can use loop unrolling.
also, remember multiple with smaller parts of the problem :-
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache
You could thread this to make better use of multiple cores
you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....
but really need to know where its costing you the most

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.
Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
r = p[j].pos - p[i].pos;
v = p[j].vel - p[i].vel;
r2 = lengthsq(r);
v2 = lengthsq(v);
// Calculate inverse of |r|^3
r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
da = r * r3i;
dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
mij = p[i].mass + p[j].mass;
tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}

You can replace any occurrence of:
a = b/c
d = e/f
with
icf = 1/(c*f)
a = bf*icf
d = ec*icf
if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.
You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js