__builtin_prefetch, How much does it read? - c++

I'm trying to optimize some C++ (RK4) by using
__builtin_prefetch
I can't figure out how to prefetch a whole structure.
I don't understand how much of the const void *addr is read. I want to have the next values of from and to loaded.
for (int i = from; i < to; i++)
{
double kv = myLinks[i].kv;
particle* from = con[i].Pfrom;
particle* to = con[i].Pto;
//Prefetch values at con[i++].Pfrom & con[i].Pto;
double pos = to->px- from->px;
double delta = from->r + to->r - pos;
double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function
double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;
double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;
double k4 = axcel(kv, delta + k3, from->mass) * dt;
#define likely(x) __builtin_expect((x),1)
if (likely(!from->bc))
{
from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);
}
}
Link: http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/

I think it just emit one FETCH machine instruction, which basically fetches a line cache, whose size is processor specific.
And you could use __builtin_prefetch (con[i+3].Pfrom) for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.
Don't use __builtin_prefetch too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2). If you are very lucky, manual __builtin_prefetch could increase the performance of your loop by 10 or 20% (but it could also hurt it).
If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).
Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.
(added in january 2018:)
Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch is less useful today (in 2018). Be sure to benchmarck.

It reads a cache line. Cache line size may vary, but it is most likely to be 64 bytes on modern CPUs. If you need to read multiple cache lines, check out prefetch_range.

Related

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};
1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).
The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.
Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).
Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.
You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.
Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

C++ (and maths) : fast approximation of a trigonometric function

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.
The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}
sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).
As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?
You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.
Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).
In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}
Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

Optimization of double subtraction in C++

I have the following code that I use to compute the distance between two vectors:
double dist(vector<double> & vecA, vector<double> & vecB){
double curDist = 0.0;
for (size_t i = 0; i < vecA.size(); i++){
double dif = vecA[i] - vecB[i];
curDist += dif * dif;
}
return curDist;
}
This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:
double dif = vecA[i] - vecB[i];
is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?
Notes:
To profile my application I have used Intel Amplifier XE;
Reducing the number of distance computations is not a feasible solution for
me;
There are two possible issues I can think of right now:
This computation is memory bound.
There is an iteration-to-iteration dependency on curDist.
This computation is memory bound.
Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.
There is an iteration-to-iteration dependency on curDist.
You have a dependency on curDist. This will block vectorization by the compiler. (Also, don't always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)
Normally, the compiler vectorizer can split up the curDist into multiple partial sums to and unroll/vectorize the loop. But it can't do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven't already. Or you can split the sum and unroll it yourself.
For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point:
double curDist0 = 0.0;
double curDist1 = 0.0;
double curDist2 = 0.0;
double curDist3 = 0.0;
for (size_t i = 0; i < vecA.size() - 3; i += 4){
double dif0 = vecA[i + 0] - vecB[i + 0];
double dif1 = vecA[i + 1] - vecB[i + 1];
double dif2 = vecA[i + 2] - vecB[i + 2];
double dif3 = vecA[i + 3] - vecB[i + 3];
curDist0 += dif0 * dif0;
curDist1 += dif1 * dif1;
curDist2 += dif2 * dif2;
curDist3 += dif3 * dif3;
}
// Do some sort of cleanup in case (vecA.size() % 4 != 0)
double curDist = curDist0 + curDist1 + curDist2 + curDist3;
You could eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also do loop unrolling to give yourself more computation per loop iteration. What compiler are you using, and what optimization settings? Compiler will often do unrolling for you, but you could manually do it.
If it's feasible (if the range of the numbers isn't huge) you may want to explore using fixed point to store these numbers, rather than doubles.
Fixed point would turn these into int operations rather than double operations.
Another interesting thing is that assuming your profile is correct, the lookups seems to be a significant factor (otherwise the multiplication would likely be more costly than the subtractions).
I'd try using a const vector iterator rather than the random access lookup. It may help in two ways: 1 - it is constant, and 2 - the serial nature of the iterator may let the processor do better caching.
If your platform does not have (or is not using) an ALU that supports floating point math, floating point libraries, by nature, are slow and consume additional non-volatile memory. I suggest instead using 32-bit (long) or 64-bit (long long) fixed-point arithmetic. Then convert the final result to floating point at the end of the algorithm. I did this on a project a couple years ago to improve the performance of an I2T algorithm and it worked wonderfully.

Fast equivalent to sin() for DSP referenced in STK

I'm using bits of Perry Cook's Synthesis Toolkit (STK) to generate saw and square waves. STK includes this BLIT-based sawtooth oscillator:
inline STKFloat BlitSaw::tick( void ) {
StkFloat tmp, denominator = sin( phase_ );
if ( fabs(denominator) <= std::numeric_limits<StkFloat>::epsilon() )
tmp = a_;
else {
tmp = sin( m_ * phase_ );
tmp /= p_ * denominator;
}
tmp += state_ - C2_;
state_ = tmp * 0.995;
phase_ += rate_;
if ( phase_ >= PI )
phase_ -= PI;
lastFrame_[0] = tmp;
return lastFrame_[0];
}
The square wave oscillator is broadly similar. At the top, there's this comment:
// A fully optimized version of this code would replace the two sin
// calls with a pair of fast sin oscillators, for which stable fast
// two-multiply algorithms are well known.
I don't know where to start looking for these "fast two-multiply algorithms" and I'd appreciate some pointers. I could use a lookup table instead, but I'm keen to learn what these 'fast sin oscillators' are. I could also use an abbreviated Taylor series, but thats way more than two multiplies. Searching hasn't turned up anything much, although I did find this approximation:
#define AD_SIN(n) (n*(2.f- fabs(n)))
Plotting it out shows that it's not really a close approximation outside the range of -1 to 1, so I don't think I can use it when phase_ is in the range -pi to pi:
Here, Sine is the blue line and the purple line is the approximation.
Profiling my code reveals that the calls to sin() are far and away the most time-consuming calls, so I really would like to optimise this piece.
Thanks
EDIT Thanks for the detailed and varied answers. I will explore these and accept one at the weekend.
EDIT 2 Would the anonymous close voter please kindly explain their vote in the comments? Thank you.
Essentially the sinusoidal oscilator is one (or more) variables that change with each DSP step, rather than getting recalculated from scratch.
The simplest are based on the following trig identities: (where d is constant, and thus so is cos(d) and sin(d) )
sin(x+d) = sin(x) cos(d) + cos(x) sin(d)
cos(x+d) = cos(x) cos(d) - sin(x) sin(d)
However this requires two variables (one for sin and one for cos) and 4 multiplications to update. However this will still be far faster than calculating a full sine at each step.
The solution by Oli Charlesworth is based on solutions to this general equation
A_{n+1} = a A_{n} + A_{n-1}
Where looking for a solution of the form A_n = k e^(i theta n) gives an equation for theta.
e^(i theta (n+1) ) = a e^(i theta n ) + b e^(i theta (n-1) )
Which simplifies to
e^(i theta) - e^(-i theta ) = a
2 cos(theta) = a
Giving
A_{n+1} = 2 cos(theta) A_{n} + A_{n-1}
Whichever approach you use you'll either need to use one or two of these oscillators for each frequency, or use another trig identity to derive the higher or lower frequencies.
How accurate do you need this?
This function, f(x)=0.398x*(3.1076-|x|), does a reasonably good job for x between -pi and pi.
Edit
An even better approximation is f(x)=0.38981969947653056*(pi-|x|), which keeps the absolute error to 0.038158444604 or less for x between -pi and pi.
A least squares minimization will yield a slightly different function.
It's not possible to generate one-off sin calls with just two multiplies (well, not a useful approximation, at any rate). But it is possible to generate an oscillator with low complexity, i.e. where each value is calculated in terms of the preceding ones.
For instance, consider that the following difference equation will give you a sinusoid:
y[n] = 2*cos(phi)*y[n-1] - y[n-2]
(where cos(phi) is a constant)
(From the original author of the VST BLT code).
As a matter of fact, I was porting the VST BLT oscillators to C#, so I was googling for good sin oscillators. Here's what I came up with. Translation to C++ is straightforward. See the notes at the end about accuumulated round-off errors.
public class FastOscillator
{
private double b1;
private double y1, y2;
private double fScale;
public void Initialize(int sampleRate)
{
fScale = AudioMath.TwoPi / sampleRate;
}
// frequency in Hz. phase in radians.
public void Start(float frequency, double phase)
{
double w = frequency * fScale;
b1 = 2.0 * Math.Cos(w);
y1 = Math.Sin(phase - w);
y2 = Math.Sin(phase - w * 2);
}
public double Tick()
{
double y0 = b1 * y1 - y2;
y2 = y1;
y1 = y0;
return y0;
}
}
Note that this particular oscillator implementation will drift over time, so it needs to be re-initialzed periodically. In this particular implementation, the magnitude of the sin wave decays over time. The original comments in the STK code suggested a two-multiply oscillator. There are, in fact, two-multiply oscillators that are reasonably stable over time. But in retrospect, the need to keep the sin(phase), and sin(m*phase) oscillators tightly in synch probably means that they have to be resynched anyway. Round-off errors between phase and m*phase mean that even if the oscillators were stable, they would drift eventually, running a significant risk of producing large spikes in values near the zeros of the BLT functions. May as well use a one-multiply oscillator.
These particular oscillators should probably be re-initialized every 30 to 100 cycles (or so). My C# implementation is frame based (i.e. it calculates an float[] array of results in a void Tick(int count, float[] result) method. The oscillators are re-synched at the end of each Tick call. Something like this:
void Tick(int count, float[] result)
{
for (int i = 0; i < count; ++i)
{
...
result[i] = bltResult;
}
// re-initialize the oscillators to avoid accumulated drift.
this.phase = (this.phase + this.dPhase*count) % AudioMath.TwoPi;
this.sinOsc.Initialize(frequency,this.phase);
this.mSinOsc.Initialize(frequency*m,this.phase*m);
}
Probably missing from the STK code. You might want to investigate this. The original code provided to the STK did this. Gary Scavone tweaked the code a bit, and I think the optimization was lost. I do know that the STK implementations suffer from DC drift, which can be almost entirely eliminated when implemented properly.
There's a peculiar hack that prevents DC drift of the oscillators, even when sweeping the frequency of the oscillators. The trick is that the oscillators should be started with an initial phase adjustment of dPhase/2. That just so happens to start the oscillators off with zero DC drift, without having to figure out wat the correct initial state for various integrators in each of the BLT oscillators.
Strangely, if the adjustment is re-adjusted whenever the frequency of the oscillator changes, then this also prevents wild DC drift of the output when sweeping the frequency of the oscillator. Whenever the frequency changes, subtract dPhase/2 from the previous phase value, recalculate dPhase for the new frequency, and then add dPhase/2.I rather suspect this could be formally proven; but I have not been able to so. All I know is that It Just Works.
For a block implementation, the oscillators should actually be initialized as follows, instead of carrying the phase adjustment in the current this.phase value.
this.sinOsc.Initialize(frequency,phase+dPhase*0.5);
this.mSinOsc.Initialize(frequency*m,(phase+dPhase*0.5)*m);
You might want to take a look here:
http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/
There's some sample code that calculates a very good appoximation of sin/cos using only multiplies, additions and the abs() function. Quite fast too. The comments are also a good read.
It essentiall boils down to this:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
const float P = 0.225;
float y = B * x + C * x * abs(x);
return P * (y * abs(y) - y) + y;
}
and works for a range of -PI to PI
If you can, you should consider memorization based techniques. Essentially store sin(x) and cos(x) values for a bunch values. To calculate sin(y), find a and b for which precomputed values exist such that a<=y<=b. Now using sin(a), sin(b), cos(a), cos(b), y-a and y-b approximately calculate sin(y).
The general idea of getting periodically sampled results from the sine or cosine function is to use a trig recursion or an initialized (barely) stable IIR filter (which can end up being pretty much the same computations). There are bunches of these in the DSP literature, of varying accuracy and stability. Choose carefully.

Can my loop be optimized any more?

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?
I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.
Any help is appreciated!
Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)
typedef double real;
struct Particle
{
Vector pos, vel, acc, jerk;
Vector oldPos, oldVel, oldAcc, oldJerk;
real mass;
};
class Vector
{
private:
real vec[3];
public:
// Operators defined here
};
real Gravity::interact(Particle *p, size_t numParticles)
{
PROFILE_FUNC();
real tau_q = 1e300;
for (size_t i = 0; i < numParticles; i++)
{
p[i].jerk = 0;
p[i].acc = 0;
}
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
Vector r = p[j].pos - p[i].pos;
Vector v = p[j].vel - p[i].vel;
real r2 = lengthsq(r);
real v2 = lengthsq(v);
// Calculate inverse of |r|^3
real r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
Vector da = r * r3i;
Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
real mij = p[i].mass + p[j].mass;
real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
real tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
return sqrt(sqrt(tau_q));
}
Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.
4a. Try replacing the if (...) tau_q = with
tau_q=minimum(...,...)
Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.
4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]
[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.
Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).
This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.
If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.
Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.
Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.
Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.
One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)
Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).
If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.
Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.
If you move the mass out too, you could store things like this:
struct ParticleData
{
Vector pos, vel, acc, jerk;
};
ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...
then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.
If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g
struct ParticleData
{
// data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
Vector4 data_x;
// data_y[0] == pos.y, data_y[1] = vel.y, etc.
Vector4 data_y;
// data_z[0] == pos.z, data_y[1] = vel.z, etc.
Vector4 data_z;
};
where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)
Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).
My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.
With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.
All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.
SSE example:
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp
Novel cache algorithm, explanation and example code:
http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/
http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp
You might also find the following deck I gave at Seattle Code Camp interesting:
http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/
Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.
Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?
What's real? Can it be a float?
Apart from that you'll have to turn to MMX/SSE/assembly optimisations.
Would you benefit from the famous "fast inverse square root" algorithm?
float InvSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.
Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.
Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.
Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.
Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.
Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.
You may want to profile to see if this is really the bottleneck first.
Thing I look for is branching, they tend to be performance killers.
You can use loop unrolling.
also, remember multiple with smaller parts of the problem :-
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache
You could thread this to make better use of multiple cores
you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....
but really need to know where its costing you the most
You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.
Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
{
for (size_t j = i+1; j < numParticles; j++)
{
r = p[j].pos - p[i].pos;
v = p[j].vel - p[i].vel;
r2 = lengthsq(r);
v2 = lengthsq(v);
// Calculate inverse of |r|^3
r3i = Constants::G * pow(r2, -1.5);
// da = r / |r|^3
// dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
da = r * r3i;
dj = (v - r * (3 * dot(r, v) / r2)) * r3i;
// Calculate new acceleration and jerk
p[i].acc += da * p[j].mass;
p[i].jerk += dj * p[j].mass;
p[j].acc -= da * p[i].mass;
p[j].jerk -= dj * p[i].mass;
// Collision estimation
// Metric 1) tau = |r|^2 / |a(j) - a(i)|
// Metric 2) tau = |r|^4 / |v|^4
mij = p[i].mass + p[j].mass;
tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
tau_est_q2 = (r2*r2) / (v2*v2);
if (tau_est_q1 < tau_q)
tau_q = tau_est_q1;
if (tau_est_q2 < tau_q)
tau_q = tau_est_q2;
}
}
You can replace any occurrence of:
a = b/c
d = e/f
with
icf = 1/(c*f)
a = bf*icf
d = ec*icf
if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.
You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.