Optimized functions to compute projection of a point on a line? - c++

Consider the following problem:
My question is the following: how to optimize the following independent functions:
// Computation of the coordinates of P
inline std::array<double, 3> P(const std::array<double, 3>& A,
const std::array<double, 3>& B,
const std::array<double, 3>& M)
{
// The most inefficient version in the world (to be verified)
std::array<double, 3> AB = {B[0]-A[0], B[1]-A[1], B[2]-A[2]};
std::array<double, 3> AM = {M[0]-A[0], M[1]-A[1], M[2]-A[2]};
double norm = std::sqrt(AB[0]*AB[0]+AB[1]*AB[1]+AB[2]*AB[2]);
double dot = AB[0]*AM[0]+AB[1]*AM[1]+AB[2]*AM[2];
double d1 = dot/norm;
std::array<double, 3> AP = {AB[0]/d1, AB[1]/d1, AB[2]/d1};
std::array<double, 3> P = {AP[0]-A[0], AP[1]-A[1], AP[2]-A[2]};
return P;
}
// Computation of the distance d0
inline double d0(const std::array<double, 3>& A,
const std::array<double, 3>& B,
const std::array<double, 3>& M)
{
// The most inefficient version in the world (to be verified)
std::array<double, 3> AB = {B[0]-A[0], B[1]-A[1], B[2]-A[2]};
std::array<double, 3> AM = {M[0]-A[0], M[1]-A[1], M[2]-A[2]};
double norm = std::sqrt(AB[0]*AB[0]+AB[1]*AB[1]+AB[2]*AB[2]);
double dot = AB[0]*AM[0]+AB[1]*AM[1]+AB[2]*AM[2];
double d1 = dot/norm;
std::array<double, 3> AP = {AB[0]/d1, AB[1]/d1, AB[2]/d1};
std::array<double, 3> P = {AP[0]-A[0], AP[1]-A[1], AP[2]-A[2]};
std::array<double, 3> MP = {P[0]-M[0], P[1]-M[1], P[2]-M[2]};
double d0 = std::sqrt(MP[0]*MP[0]+MP[1]*MP[1]+MP[2]*MP[2]);
return d0;
}
// Computation of the distance d1
inline double d1(const std::array<double, 3>& A,
const std::array<double, 3>& B,
const std::array<double, 3>& M)
{
// The most inefficient version in the world (to be verified)
std::array<double, 3> AB = {B[0]-A[0], B[1]-A[1], B[2]-A[2]};
std::array<double, 3> AM = {M[0]-A[0], M[1]-A[1], M[2]-A[2]};
double norm = std::sqrt(AB[0]*AB[0]+AB[1]*AB[1]+AB[2]*AB[2]);
double dot = AB[0]*AM[0]+AB[1]*AM[1]+AB[2]*AM[2];
double d1 = dot/norm;
}
// Computation of the distance d2
inline double d2(const std::array<double, 3>& A,
const std::array<double, 3>& B,
const std::array<double, 3>& M)
{
// The most inefficient version in the world (to be verified)
std::array<double, 3> AB = {B[0]-A[0], B[1]-A[1], B[2]-A[2]};
std::array<double, 3> AM = {M[0]-A[0], M[1]-A[1], M[2]-A[2]};
double norm = std::sqrt(AB[0]*AB[0]+AB[1]*AB[1]+AB[2]*AB[2]);
double dot = AB[0]*AM[0]+AB[1]*AM[1]+AB[2]*AM[2];
double d1 = dot/norm;
double d2 = norm-d1;
return d2;
}
So that each function will be as much optimized as possible ? (I will execute these functions billion times).

From algorithm point of view, you can calculate projection of vector to another vector not using SQRT call
the pseudocode from here
http://www.euclideanspace.com/maths/geometry/elements/line/projections/
// projection of vector v1 onto v2
inline vector3 projection( const vector3& v1, const vector3& v2 ) {
float v2_ls = v2.len_squared();
return v2 * ( dot( v2, v1 )/v2_ls );
}
where dot() is a dot product of two vectors and len_squared is the dot product of vector with self.
NOTE: Try to pre calculate inverse of v2_ls before main loop, if possible.

It is probably better to compute all requested quantities in a single go.
Let P = A + t . AB the vector equation giving the position of P. Express that MP and AB are orthogonal: MP . AB = 0 = (MA + t . AB) . AB, which yields t= - (MA . AB) / AB^2, and P.
t is the ratio AP / AB, hence d1 = t . |AB|. Similarly, d2 = (1 - t) . |AB|. d0 is obtained from Pythagoras, d0^2 = MA^2 - d1^2, or by direct computation of |MP|.
Accounting: compute MA (3 add), AB (3 add), AB^2 (2 add, 3 mul), MA.AB (2 add, 3 mul), t (1 div), P (3 add, 3 mul), |AB| (1 sqrt), d1 (1 mul), d2 (1 add, 1 mul), MA^2 (2 add, 3 mul), d0 (1 add, 1 mul, 1 sqrt).
Total 17 add, 15 mul, 1 div, 2 sqrt.

If you want portable code, so no processor specific features are used, I'd suggest the following:-
1) As I mentioned in my comment above, create a 3D vector class, it will just make it a lot easier to write the code (optimise development time)
2) Create an intersection class that uses lazy evaluation to get P, d0 and d1, like this:-
class Intersection
{
public:
Intersection (A, B, M) { store A, B, M; constants_calculated = false }
Point GetP () { CalculateConstants; Return P; }
double GetD0 () { CalculateConstants; Return D0; }
double GetD1 () { CalculateConstants; Return D1; }
private:
CalculateConstants ()
{
if (!constants_calculate)
{
calculate and store common expressions required for P, d0 and d1
constants_calculate = true
}
}
3) Don't call it a billion times. Not doing something is infinitely quicker. Why does it need to be called so often? Is there a way to do the same thing with fewer calls to find P, d0 and d1?
If you can use processor specific features, then you could look into doing things like using SIMD, but that may require dropping the precision from double to float.

The following is a C++implementation of point-to-line projection calculation
#include <iostream>
#include <cmath>
using namespace std;
int main() {
// the point
double x0 = 1.0, y0 = 1.0;
// the line equation
double A = 1.0, B = 1.0, C = 0.0;
// calc point to line distance
double dist = fabs(A * x0 + B * y0 + C) / sqrt(A * A + B * B);
// calc project point coord
double x1 = x0 - dist * A / sqrt(A * A + B * B);
double y1 = y0 - dist * B / sqrt(A * A + B * B);
// result
cout << "project point:(" << x1 << ", " << y1 << ")" << endl;
return 0;
}

Related

Efficient floating point scaling in C++

I'm working on my fast (and accurate) sin implementation in C++, and I have a problem regarding the efficient angle scaling into the +- pi/2 range.
My sin function for +-pi/2 using Taylor series is the following
(Note: FLOAT is a macro expanded to float or double just for the benchmark)
/**
* Sin for 'small' angles, accurate on [-pi/2, pi/2], fairly accurate on [-pi, pi]
*/
// To switch between float and double
#define FLOAT float
FLOAT
my_sin_small(FLOAT x)
{
constexpr FLOAT C1 = 1. / (7. * 6. * 5. * 4. * 3. * 2.);
constexpr FLOAT C2 = -1. / (5. * 4. * 3. * 2.);
constexpr FLOAT C3 = 1. / (3. * 2.);
constexpr FLOAT C4 = -1.;
// Correction for sin(pi/2) = 1, due to the ignored taylor terms
constexpr FLOAT corr = -1. / 0.9998431013994987;
const FLOAT x2 = x * x;
return corr * x * (x2 * (x2 * (x2 * C1 + C2) + C3) + C4);
}
So far so good... The problem comes when I try to scale an arbitrary angle into the +-pi/2 range. My current solution is:
FLOAT
my_sin(FLOAT x)
{
constexpr FLOAT pi = 3.141592653589793238462;
constexpr FLOAT rpi = 1 / pi;
// convert to +-pi/2 range
int n = std::nearbyint(x * rpi);
FLOAT xbar = (n * pi - x) * (2 * (n & 1) - 1);
// (2 * (n % 2) - 1) is a sign correction (see below)
return my_sin_small(xbar);
};
I made a benchmark, and I'm losing a lot for the +-pi/2 scaling.
Tricking with int(angle/pi + 0.5) is a nope since it is limited to the int precision, also requires +- branching, and i try to avoid branches...
What should I try to improve the performance for this scaling? I'm out of ideas.
Benchmark results for float. (In the benchmark the angle could be out of the validity range for my_sin_small, but for the bench I don't care about that...):
Benchmark results for double.
Sign correction for xbar in my_sin():
Algo accuracy compared to python sin() function:
Candidate improvements
Convert the radians x to rotations by dividing by 2*pi.
Retain only the fraction so we have an angle (-1.0 ... 1.0). This simplifies the OP's modulo step to a simple "drop the whole number" step instead. Going forward with different angle units simply involves a co-efficient set change. No need to scale back to radians.
For positive values, subtract 0.5 so we have (-0.5 ... 0.5) and then flip the sign. This centers the possible values about 0.0 and makes for better convergence of the approximating polynomial as compared to the math sine function. For negative values - see below.
Call my_sin_small1() that uses this (-0.5 ... 0.5) rotations range rather than [-pi ... +pi] radians.
In my_sin_small1(), fold constants together to drop the corr * step.
Rather than use the truncated Taylor's series, use a more optimal set. IMO, this will provide better answers, especially near +/-pi.
Notes: No int to/from float code. With more analysis, possible to get a better set of coefficients that fix my_sin(+/-pi) closer to 0.0. This is just a quick set of code to demo less FP steps and good potential results.
C like code for OP to port to C++
FLOAT my_sin_small1(FLOAT x) {
static const FLOAT A1 = -5.64744881E+01;
static const FLOAT A2 = +7.81017968E+01;
static const FLOAT A3 = -4.11145353E+01;
static const FLOAT A4 = +6.27923581E+00;
const FLOAT x2 = x * x;
return x * (x2 * (x2 * (x2 * A1 + A2) + A3) + A4);
}
FLOAT my_sin1(FLOAT x) {
static const FLOAT pi = 3.141592653589793238462;
static const FLOAT pi2i = 1/(pi * 2);
x *= pi2i;
FLOAT xfraction = 0.5f - (x - truncf(x));
return my_sin_small1(xfraction);
}
For negative values, use -my_sin1(-x) or like code to flip the sign - or add 0.5 in the above minus 0.5 step.
Test
#include <math.h>
#include <stdio.h>
int main(void) {
for (int d = 0; d <= 360; d += 20) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = my_sin1(x);
printf("%12.6f %11.8f %11.8f\n", x, sin(x), y);
}
}
Output
0.000000 0.00000000 -0.00022483
0.349066 0.34202013 0.34221691
0.698132 0.64278759 0.64255589
1.047198 0.86602542 0.86590189
1.396263 0.98480775 0.98496443
1.745329 0.98480775 0.98501128
2.094395 0.86602537 0.86603642
2.443461 0.64278762 0.64260530
2.792527 0.34202022 0.34183803
3.141593 -0.00000009 0.00000000
3.490659 -0.34202016 -0.34183764
3.839724 -0.64278757 -0.64260519
4.188790 -0.86602546 -0.86603653
4.537856 -0.98480776 -0.98501128
4.886922 -0.98480776 -0.98496443
5.235988 -0.86602545 -0.86590189
5.585053 -0.64278773 -0.64255613
5.934119 -0.34202036 -0.34221727
6.283185 0.00000017 -0.00022483
Alternate code below makes for better results near 0.0, yet might cost a tad more time. OP seems more inclined to speed.
FLOAT xfraction = 0.5f - (x - truncf(x));
// vs.
FLOAT xfraction = x - truncf(x);
if (x >= 0.5f) x -= 1.0f;
[Edit]
Below is a better set with about 10% reduced error.
-56.0833765f
77.92947047f
-41.0936875f
6.278635918f
Yet another approach:
Spend more time (code) to reduce the range to ±pi/4 (±45 degrees), then possible to use only 3 or 2 terms of a polynomial that is like the usually Taylors series.
float sin_quick_small(float x) {
const float x2 = x * x;
#if 0
// max error about 7e-7
static const FLOAT A2 = +0.00811656036940792f;
static const FLOAT A3 = -0.166597759850666f;
static const FLOAT A4 = +0.999994132743861f;
return x * (x2 * (x2 * A2 + A3) + A4);
#else
// max error about 0.00016
static const FLOAT A3 = -0.160343346851626f;
static const FLOAT A4 = +0.999031566686144f;
return x * (x2 * A3 + A4);
#endif
}
float cos_quick_small(float x) {
return cosf(x); // TBD code.
}
float sin_quick(float x) {
if (x < 0.0) {
return -sin_quick(-x);
}
int quo;
float x90 = remquof(fabsf(x), 3.141592653589793238462f / 2, &quo);
switch (quo % 4) {
case 0:
return sin_quick_small(x90);
case 1:
return cos_quick_small(x90);
case 2:
return sin_quick_small(-x90);
case 3:
return -cos_quick_small(x90);
}
return 0.0;
}
int main() {
float max_x = 0.0;
float max_error = 0.0;
for (int d = -45; d <= 45; d += 1) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = sin_quick(x);
double err = fabs(y - sin(x));
if (err > max_error) {
max_x = x;
max_error = err;
}
printf("%12.6f %11.8f %11.8f err:%11.8f\n", x, sin(x), y, err);
}
printf("x:%.6f err:%.6f\n", max_x, max_error);
return 0;
}

Dot function in unreal c++

This is the code in Unreal C++
float GetT( float t, float alpha, const FVector& p0, const FVector& p1 )
{
auto d = p1 - p0;
float a = d | d; // Dot product
float b = FMath::Pow( a, alpha*.5f );
return (b + t);
}
Does this line means "float a = d | d; // Dot product" dot product of FVector d with itself
https://en.wikipedia.org/wiki/Centripetal_Catmull%E2%80%93Rom_spline
Look for documentation of FVector. Search "operators". Look for |. Find:
float
operator|
(
const FVector& V
)
Calculate the dot product between this and another vector.
Yes. d | d calculates the dot product of the vector with itself.

Eigen: Why is Map slower than Vector3d for this template expression?

I have a cloud of points in a std::vector<double> in an x, y, z pattern, and a std::vector<int> of indices where each triplet of consecutive integers is the connectivity of a face. Basically a simple triangular mesh data structure.
I have to compute the areas of all the faces and I am benchmarking several methods:
I can wrap chunks of data in an Eigen::Map<const Eigen::Vector3d> like this:
static void face_areas_eigenmap(const std::vector<double>& V,
const std::vector<int>& F,
std::vector<double>& FA) {
// Number of faces is size / 3.
for (auto f = 0; f < F.size() / 3; ++f) {
// Get vertex indices of face f.
auto v0 = F[f * 3];
auto v1 = F[f * 3 + 1];
auto v2 = F[f * 3 + 2];
// View memory at each vertex position as a vector.
Eigen::Map<const Eigen::Vector3d> x0{&V[v0 * 3]};
Eigen::Map<const Eigen::Vector3d> x1{&V[v1 * 3]};
Eigen::Map<const Eigen::Vector3d> x2{&V[v2 * 3]};
// Compute and store face area.
FA[f] = 0.5 * (x1 - x0).cross(x2 - x0).norm();
}
}
Or I can choose to create Eigen::Vector3d like this:
static void face_areas_eigenvec(const std::vector<double>& V,
const std::vector<int>& F,
std::vector<double>& FA) {
for (auto f = 0; f < F.size() / 3; ++f) {
auto v0 = F[f * 3];
auto v1 = F[f * 3 + 1];
auto v2 = F[f * 3 + 2];
// This is the only change, swap Map for Vector3d.
Eigen::Vector3d x0{&V[v0 * 3]};
Eigen::Vector3d x1{&V[v1 * 3]};
Eigen::Vector3d x2{&V[v2 * 3]};
FA[f] = 0.5 * (x1 - x0).cross(x2 - x0).norm();
}
}
Finally I am also considering the hardcoded version with the explicit cross product and norm:
static void face_areas_ptr(const std::vector<double>& V,
const std::vector<int>& F, std::vector<double>& FA) {
for (auto f = 0; f < F.size() / 3; ++f) {
const auto* x0 = &V[F[f * 3] * 3];
const auto* x1 = &V[F[f * 3 + 1] * 3];
const auto* x2 = &V[F[f * 3 + 2] * 3];
std::array<double, 3> s0{x1[0] - x0[0], x1[1] - x0[1], x1[2] - x0[2]};
std::array<double, 3> s1{x2[0] - x0[0], x2[1] - x0[1], x2[2] - x0[2]};
std::array<double, 3> c{s0[1] * s1[2] - s0[2] * s1[1],
s0[2] * s1[0] - s0[0] * s1[2],
s0[0] * s1[1] - s0[1] * s1[0]};
FA[f] = 0.5 * std::sqrt(c[0] * c[0] + c[1] * c[1] + c[2] * c[2]);
}
}
I have benchmarked these methods and the version using Eigen::Map is always the slowest despite doing the same exact thing as the one using Eigen::Vector3d, I was expecting no change in performance as a map is basically a pointer.
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_face_areas_eigenvec 59757936 ns 59758018 ns 11
BM_face_areas_ptr 58305018 ns 58304436 ns 11
BM_face_areas_eigenmap 62356850 ns 62354710 ns 10
I have tried switching the Eigen template expression in the map version with the same code as in the pointer version:
std::array<double, 3> s0{x1[0] - x0[0], x1[1] - x0[1], x1[2] - x0[2]};
std::array<double, 3> s1{x2[0] - x0[0], x2[1] - x0[1], x2[2] - x0[2]};
std::array<double, 3> c{s0[1] * s1[2] - s0[2] * s1[1],
s0[2] * s1[0] - s0[0] * s1[2],
s0[0] * s1[1] - s0[1] * s1[0]};
FA[f] = 0.5 * std::sqrt(c[0] * c[0] + c[1] * c[1] + c[2] * c[2]);
And magically the timings are comparable:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_face_areas_array 58967864 ns 58967891 ns 11
BM_face_areas_ptr 60034545 ns 60034682 ns 11
BM_face_areas_eigenmap 60382482 ns 60382027 ns 11
Is there something wrong with Eigen::Map in Eigen expressions to be aware of?
Looking at the compiler output it seems like the second version makes the compiler emit fewer memory loads by aggregating some of them into vector loads.
https://godbolt.org/z/qs38P41eh
Eigen's code for cross does not contain any explicit vectorization. It depends on the compiler doing a good job with it. And because you call cross on an expression (the subtractions), the compiler gives up a little too soon. Basically, it is the compiler's fault for not finding the same optimization.
Your third code works the same as the second because the compiler recognizes the subtraction (creation of s0 and s1) as something it can do vectorized, resulting in equivalent code. You can achieve the same with Eigen if you do it like this:
Eigen::Map<const Eigen::Vector3d> x0{&V[v0 * 3]};
Eigen::Map<const Eigen::Vector3d> x1{&V[v1 * 3]};
Eigen::Map<const Eigen::Vector3d> x2{&V[v2 * 3]};
Eigen::Vector3d s0 = x1 - x0;
Eigen::Vector3d s1 = x2 - x0;
// Compute and store face area.
FA[f] = 0.5 * s0.cross(s1).norm();

"Point-Segment" distance: shouldn't this code use the norm instead of the norm squared?

I am using a piece of code I have found on the internet (here) to compute the distance between a point and a segment. Here is the code:
float
dist_Point_to_Segment( Point P, Segment S)
{
Vector v = S.P1 - S.P0;
Vector w = P - S.P0;
double c1 = dot(w,v);
if ( c1 <= 0 )
return d(P, S.P0);
double c2 = dot(v,v);
if ( c2 <= c1 )
return d(P, S.P1);
double b = c1 / c2;
Point Pb = S.P0 + b * v;
return d(P, Pb);
}
When computing double b = c1 / c2; c2 is dot(v, v) (so, the norm of v squared). Shouldn't we use norm(v)? Isn't that the proper definition of the projection of a vector on another one?
Thanks.
Actually the definition is with norm(v) squared. So dot(v, v) is correct.
Here's a nice and short explanation:
http://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/dotprod/dotprod.html
If v is normalized, the length of the projection is w.v, and the projected vector is (w.v) v.
As v appears twice, the formula for an unnormalized vector is
(w.(v/|v|)) v/|v| = (w.v/|v|²) v
This spares a square root.

Generating a C++ function from a list of argument

I am writing a function f to be used in a Runge Kutta integrator.
output RungeKutta(function f, initial conditions IC, etc.)
Since the function will be called many times, I am looking for a way to generate the function f at compile time.
In this case, function f depends on a fixed list of parameters vector p, where p is sparse and is fixed before the code is compiled. To be concrete,
double function f(vector<double> x) {
return x dot p;
}
Since p is sparse, taking the dot product in f is not the most efficient. Hard-coding x dot p seems to be the way to go, but p can be very long (1000).
What are my options?
Is writing another program (taking p as input) to generate a .cpp file my only option?
Thanks for the comments. Here is a more concrete example for the differential equation.
dy/dx = f_p(x)
One example for f_p(x):
p = [0, 1, 0]; x = [x1, x2, x3]
double f_p(vector<double> x) {
return x2; // This is what I meant by hard-coding
}
instead of:
double f(vector<double> p, vector<double> x) {
double r = 0;
for (i=0; i < p.length(); i++) {
r += p[i]*x[i];
}
return r;
}
The key problem you are trying to solve is that a "leaf" function in your calculation that will be called many times will also most often do no work given the problem domain. The hope is that the redundant work - namely multiplying a value with an element of an array known at compile time to be zero - can be collapsed as part of a compile time step.
C++ has language facilities to deal with this, namely template metaprogramming. C++ templates are very powerful (ie Turing complete) and allow for things like recursive calculations based on compile time constants.
Below is an example of how to implement your example using templates and template specialization (you can also find a runnable example I've created here http://ideone.com/BDtBt7). The basic idea behind the code is to generate a type with a static function that returns the resulting dot product of an input vector of values and a compile time constant array. The static function recursively calls instances of itself, passing a lower index value as it moves through the input/constant arrays of elements. It is also templated with whether the value in the compile time constant array p that is being evaluated is zero. If it is, we can skip calculating that and move onto the next value in the recursion. Lastly, there is a base case that stops the recursion once we have reached the first element in the array.
#include <array>
#include <iostream>
#include <vector>
constexpr std::array<double, 5> p = { 1.0, 0.0, 3.0, 5.0, 0.0 };
template<size_t index, bool isZero>
struct DotProductCalculator
{
static double Calculate(const std::vector<double>& xArg)
{
return (xArg[index] * p[index])
+ DotProductCalculator<index - 1, p[index - 1] == 0.0>::Calculate(xArg);
}
};
template<>
struct DotProductCalculator<0, true>
{
static double Calculate(const std::vector<double>& xArg)
{
return 0.0;
}
};
template<>
struct DotProductCalculator<0, false>
{
static double Calculate(const std::vector<double>& xArg)
{
return xArg[0] * p[0];
}
};
template<size_t index>
struct DotProductCalculator<index, true>
{
static double Calculate(const std::vector<double>& xArg)
{
return 0.0 + DotProductCalculator<index - 1, p[index - 1] == 0.0>::Calculate(xArg);
}
};
template<typename ArrayType>
double f_p_driver(const std::vector<double>& xArg, const ArrayType& pAsArgument)
{
return DotProductCalculator<std::tuple_size<ArrayType>::value - 1,
p[std::tuple_size<ArrayType>::value -1] == 0.0>::Calculate(xArg);
}
int main()
{
std::vector<double> x = { 1.0, 2.0, 3.0, 4.0, 5.0 };
double result = f_p_driver(x, p);
std::cout << "Result: " << result;
return 0;
}
You say in the comments that P really is a row or column of a matrix, and that the matrix is sparse. I'm not familiar with the specific physical problem you are solving, but often, sparse matrices have a fixed diagonal "banding" structure of some kind, e.g.:
| a1 b1 0 0 0 0 0 d1 |
| c1 a2 b2 0 0 0 0 0 |
| 0 c2 a3 b3 0 0 0 0 |
| 0 0 c3 a4 b4 0 0 0 |
| 0 0 0 c4 a5 b5 0 0 |
| 0 0 0 0 c5 a6 b6 0 |
| 0 0 0 0 0 c6 a7 b7 |
| e1 0 0 0 0 0 c7 a8 |
The most efficient way to store such matrices tends to be to store the diagonals as arrays/vectors, so:
A = [a1, a2, a3, a4, a5, a6, a7, a8]
B = [b1, b2, b3, b4, b5, b6, b7]
C = [c1, c2, c3, c4, c5, c6, c7]
D = [d1]
E = [e1]
Multiplying a row-vector X = [x1, x2, x3, x4, x5, x6, x7, x8] by the above matrix thus becomes:
Y = X . M
Y[0] = X[0] * A[0] + X[1] * C[0] + X[7] * E[0]
Y[1] = X[0] * B[0] + X[1] * A[1] + X[2] * C[1]
etc.
or more generally:
Y[i] = X[i-7] * D[i] + X[i-1] * B[i] + X[i] * A[i] + X[i+1] * C[i] + X[i+7] * E[i]
Where out-of-range array accesses (< 0 or >= 8 should be treated as evaluating to 0. To avoid having to test for out-of-bounds everywhere, you can actually store each diagonal and the vector itself in oversize arrays which have leading and trailing elements filled with zeroes.
Note that this will also be highly cache efficient, as all array accesses are linear.
With the given constraints I would create a custom function object which stores the matrix p and computes the operation in its function call operator. I would implement two versions of the function: one which preprocesses the matrix upon construction to "know" where the non-zero elements are and one which just does the operations as stated, accepting that many of the computations just result in 0. The quoted amount of 10% non-zero elements sounds likely to be too dense for the complication from taking advantage of the sparsity to pay off.
Ignoring that p is a matrix and using it as a vector, the version without preprocessing would be something like this:
class dotProduct {
std::vector<double> p;
public:
dotProduct(std::vector<double> const& p): p(p) {}
double operator()(std::vector<double> const& x) const {
return std::inner_product(p.begin(), p.end(), x.begin());
}
};
// ...
... RungeKutta(dotProduct(p), initial conditions IC, etc.);
When using C++11 a lambda function could be used instead:
... RungeKutta([=](std::vector<double> const& x) {
return std::inner_product(p.begin(), p.end(), x.begin());
}, intitial conditions IC, etc.);
For the preprocessing version you'd store a std::vector<std::pair<double, std::size_t>> indicating which indices actually need to be multiplied:
class sparseDotProduct {
typedef std::vector<std::pair<double, std::size_t>> Vector;
Vector p;
public:
sparsedotProduct(std::vector<double> const& op) {
for (std::size_t i(0), s(op.size()); i != s; ++i) {
if (op[i]) {
p.push_back(std::make_pair(op[i], i));
}
}
}
double operator()(std::vector<double> const& x) {
double result(0);
for (Vector::const_iterator it(p.begin()), end(p.end()); it != end; ++it) {
result += it->first * x[it->second];
}
return result;
}
};
The use of this function object is just the same although it may be reasonable to keep this object around if p doesn't change.
I would personally expect that the non-sparse version actually outperforms the sparse version if there are 10% non-zero values. However, with these two versions around it should be relatively simple to measure the performance of the different approaches. I wouldn't expect a custom created code to be substantially better although it could improve on the computation. If so, it may work to use meta programming techniques to create the code but I doubt that this would be too practical.