Double variable as loop counter - c++

I often have to code explicit schemes which means that I have to look at the evolution of a function by incrementing the time variable t <- t+ dt. It is therefore only natural to have my loops increment on dt:
int N = 7;
double t=0., T = 1., dt=T/N; // or I could have dt=0.03 for example
for(; t<T; t+= dt){
if(T - t < dt){
dt = T-t;
}
//some functions that use dt, t, T etc
}
The rationale behind this is that I'm incrementing t by a constant dt at each step, except at the last iteration, where if my current time t is such that T- dt < t < T then I modify my time increment by dt <- T-t.
What are the possible pitfalls of such a procedure or ways I could improve it? I do realise that I might get a very small time increment.
Are there any floating problems that might appear (should I stick to incrementing on integers)?
In terms of optimisation, I assume that this technique is not costly at all, since a basic branch prediction would almost always skip the if block.
EDIT
I realise my question wasn't really good. Usually the dt is given by a CFL condition i.e. it is given so that it is small enough compared to some other parameters.
So from a logical point of view, dt is first given and afterwards we can define an integer N=floor(T/dt), loop with integers up to N, then deal with the leftover time interval N*dt --- T.
The code would be:
double dt = //given by some function;
double t=0., T = 1.;
for(; t<T; t+= dt){
if(T - t < dt){
dt = T-t;
}
//some functions that use dt, t, T etc
}

First at all the compensation if (T - t < dt) is not needed, because it's only purpose appears to set the last value to t == T, which won't be processed due to inequality ...;t < T;... in the for loop condition.
That being said, finite difference method doesn't work that well with floats unless N is a power of two. If e.g. one wishes to evaluate a function at steps of 0.1f, one will most likely miss a few integral points.
Branch prediction may skip the condition evaluation, but there may also be a penalty/latency in mixing floating point operations with flow control operations.
Due to the cumulating rounding errors, it's possible that the iteration count can't be easily determined by the optimizer, disallowing some optimizations (loop unrolling or even vectorizing).
The inaccuracy can be mitigated simply by linear interpolation: t = c * dt;, but not perfectly, because not for all cases (dbl / N) * N == dbl. In practice the error should be in the epsilon magnitude. To get exact ending value, one has to calculate t = (range * N) / N; this time making sure range * N doesn't drop the least significant bits.

With the new information that dt must be set to a fixed, predetermined value
(at least for all but the last step), here is my recommendation:
double T0 = 0.0;
double T = 1.0;
int N = floor((T - T0)/dt);
double t = T0;
for (int step_number = 0; step_number < N; ++step_number, t += dt)
{
t = T0 + step_number * dt;
do_one_step(t, T, dt);
}
if (t < T)
{
do_one_step(t, T, T - t);
}
The function do_one_step performs the necessary calculations using
t, T, and dt for each iteration.
The data that must be updated by the function can either be made member
variables of the class or can be included in the function parameter
list as non-const references.
By the way, I have the last call to the function outside the loop not in
order to save the possible cost of the branch condition, but because
I find the code to be better organized and easier to understand that way.
The old answer:
You can easily get a very small time increment at the end, as you say,
because the result T/N is typically not exact
(and 0.03 certainly is not exact).
I would prefer to develop t and dt like this:
int N = 7;
double t = 0.0;
double T = 1.0;
double dt = (T - t)/N;
for (int step_number = 0; step_number < N; ++step_number, t += dt)
{
// ... calculations with t, T, dt, etc.
}
(Notice that this says dt = (T - t)/N just in case you ever decide to start the iterations at a non-zero value of t.)
Alternatively, and potentially slightly more accurate if N is very large
(because t += dt effectively has to round off dt as soon as t gets much larger):
int N = 7;
double T = 1.0;
double dt = T/N;
for (int step_number = 0; step_number < N; ++step_number)
{
double t = T0 + step_number * dt;
// ... calculations with t, T, dt, etc.
}

Related

Performance bottlenecks in fast evaluation of trig functions using Eigen and MEX

In a project using Matlab's C++ MEX API, I have to compute the value exp(j * 2pi * x) for over 100,000 values of x where x is always a positive double. I've written some helper functions that breakdown the computation into sin/cos using euler's formula. I then apply the method of range reduction to reduce my values to their corresponding points in the domain [0,T/4] where T is the period of the exponential I'm computing. I keep track of which quadrant in [0, T] the original value would have fallen into for later. I can then compute the trig function using a taylor series polynomial in horner form and apply the appropriate shift depending on which quadrant the original value was in. For further information on some of the concepts in this technique, check out this answer. Here is the code for this function:
Eigen::VectorXcd calcRot2(const Eigen::Ref<const Eigen::VectorXd>& idxt) {
Eigen::VectorXd vidxt = idxt.array() - idxt.array().floor();
Eigen::VectorXd quadrant = (vidxt.array()*2+0.5).floor();
vidxt.array() -= (quadrant.array()*0.5);
vidxt.array() *= 2*3.14159265358979;
const Eigen::VectorXd sq = vidxt.array()*vidxt.array();
Eigen::VectorXcd M(vidxt.size());
M.real() = fastCos2(sq);
M.imag() = fastSin2(vidxt,sq);
M = (quadrant.array() == 1).select(-M,M);
return M;
}
I profiled the code segment in which this function is called using std::chrono and averaged over 500 calls to the function (where each call to the mex function processes all 100,000+ values by calling calcRot2 in a loop. Each iteration passes about 200 values to calcRot2). I find the following average runtimes:
runtime with calcRot2: 75.4694 ms
runtime with fastSin/Cos commented out: 50.2409 ms
runtime with calcRot2 commented out: 30.2547 ms
Looking at the difference between the two extreme cases, it seems like calcRot has a large contribution to the runtime. However, only a portion of that comes from the sin/cos calculation. I would assume Eigen's implicit vectorization and the compiler would make the runtime of the other operations in the function effectively negligible. (floor shouldn't be a problem!) Where exactly is the performance bottleneck here?
This is the compilation command I'm performing (It uses MinGW64 which I think is the same as gcc):
mex(ipath,'CFLAGS="$CFLAGS -O3 -fno-math-errno -ffast-math -fopenmp -mavx2"','LDFLAGS="$LDFLAGS -fopenmp"','DAS.cpp','DAShelper.cpp')
Reference Code
For reference, here is the code segment in the main mex function where the timer is called, and the helper function that calls calcRot2():
MEX function call:
chk1 = std::chrono::steady_clock::now();
// Calculate beamformed signal at each point
Eigen::MatrixXcd bfVec(p.nPoints,1);
#pragma omp parallel for
for (int i = 0; i < p.nPoints; i++) {
calcPoint(idxt.col(i),SIG,p,bfVec(i));
}
chk2 = std::chrono::steady_clock::now();
auto diff3 = chk2 - chk1;
calcPoint:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::MatrixXcd>& SIG,
Parameters& p, std::complex<double>& bfVal) {
Eigen::VectorXcd pRot = calcRot2(idxt*p.fc/p.fs);
int j = 0;
for (auto x : idxt) {
if(x >= 0) {
int vIDX = static_cast<int>(x);
bfVal += (SIG(vIDX,j)*(vIDX + 1 - x) + SIG(vIDX+1,j)*(x - vIDX))*pRot(j);
}
j++;
}
}
Clarification
To clarify, the line
(vidxt.array()*2+0.5).floor()
is meant to yield:
0 if vidxt is between [0,0.25]
1 if vidxt is between [0.25,0.75]
2 if vidxt is between [0.75,1]
The idea here is that when vidxt is in the second interval (quadrants 2 and 3 on the unit circle for functions with period 2pi), then the value needs to map to its negative value. Otherwise, the range reduction maps the values to the correct values.
The benefits of Eigen's vectorization are outweighed because you evaluate your expressions into temporary vectors. Allocating, deallocating, filling and reading these vectors has cost that seems significant. This is especially so because the expressions themselves are relatively simple (just a few scalar operations).
Expression objects
What usually helps here is aggregating into fewer expressions. For example line 3 and 4 can be collapsed into one:
vidxt.array() = 2*3.14159265358979 * (vidxt.array() - quadrant.array()*0.5);
(BTW: Note that that math.h contains a constant M_PI with pi in double precision).
Beyond that, Eigen expressions can be combined and reused. Something like this:
auto vidxt0 = idxt.array() - idxt.array().floor();
auto quadrant = (vidxt0*2+0.5).floor();
auto vidxt = 2*3.14159265358979 * (vidxt0 - quadrant.array()*0.5);
auto sq = vidxt.array().square();
Eigen::VectorXcd M(vidxt.size());
M.real() = fastCos2(sq);
M.imag() = fastSin2(vidxt,sq);
M = (quadrant.array() == 1).select(-M,M);
Note that none of the auto values are vectors. They are expression objects that behave like arrays and can be evaluated into vectors or arrays.
You can pass these on to your fastCos2 and fastSin2 function by declaring them as templates. The typical Eigen pattern would be something like
template<Derived>
void fastCos2(const Eigen::ArrayBase<Derived>& sq);
The idea here is that ultimately, everything compiles into one huge loop that gets executed when you evaluate the expression into a vector or array. If you reference the same sub-expression multiple times, the compiler may be able to eliminate the redundant computations.
Unfortunately, I could not get any better performance out of this particular code, so it is no real help here but it is still something worth exploring in these kind of cases.
fastSin/Cos return value
Speaking of temporary vectors: You didn't include the code for your fastSin/Cos functions but it looks a lot like you return a temporary vector which is then copied into the real and imaginary parts or the actual return value. This is another temporary that you may want to avoid. Something like this:
template<class Derived1, class Derived2>
void fastCos2(const Eigen::MatrixBase<Derived1>& M, const Eigen::MatrixBase<Derived2>& sq)
{
Eigen::MatrixBase<Derived1>& M_mut = const_cast<Eigen::MatrixBase<Derived1>&>(M);
M_mut = sq...;
}
fastCos2(M.real(), sq);
Please refer to Eigen's documentation on the topic of function arguments.
The downside of this approach in this particular case is that now the output is not consecutive (real and imaginary parts are interleaved). This may affect vectorization negatively. You may be able to work around this by combining the sin and cos functions into one expression for both. Benchmarking is required.
Using a plain loop
As others have pointed out, using a loop may be easier in this particular case. You noted that this was slower. I have a theory why: You did not specify -DNDEBUG in your compile options. If you don't, all array indices in Eigen vectors are range-checked with an assertion. These cost time and prevent vectorization. If you include this compile flag, I find my code significantly faster than using Eigen expressions.
Alternatively, you can use raw C pointers to the input and output vector. Something like this:
std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
const double* iidxt = idxt.data();
std::complex<double>* iM = M.data();
for(std::ptrdiff_t j = 0; j < n; ++j) {
double ival = iidxt[j];
double vidxt = ival - std::floor(ival);
double quadrant = std::floor(vidxt * 2. + 0.5);
vidxt = (vidxt - quadrant * 0.5) * (2. * 3.14159265358979);
double sq = vidxt * vidxt;
// stand-in for sincos
std::complex<double> jval(sq, vidxt + sq);
iM[j] = quadrant == 1. ? -jval : jval;
}
Fixed sized arrays
To avoid the cost of memory allocation and make it easier for the compiler to avoid memory operations in the first place, it can help to run the computation on blocks of fixed size. Something like this:
std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
std::ptrdiff_t i;
for(i = 0; i + 4 <= n; i += 4) {
Eigen::Array4d idxt_i = idxt.segment<4>(i);
...
M.segment<4>(i) = ...;
}
if(i + 2 <= n) {
Eigen::Array2D idxt_i = idxt.segment<2>(i);
...
M.segment<2>(i) = ...;
i += 2;
}
if(i < n) {
// last index scalar
}
This kind of stuff needs careful tuning to ensure that vectorized code is generated and there are no unnecessary temporary values on the stack. If you can read assembler, Godbolt is very helpful.
Other remarks
Eigen includes vectorized versions of sin and cos. Have you compared your code to these instead of e.g. Eigen's complex exp function?
Depending on your math library, there is also an explicit sincos function to compute sine and cosine in one function. It is not vectorized but still saves time on range reduction. You can (usually) access it through std::polar. Try this:
Eigen::VectorXd scale = ...;
Eigen::VectorXd phase = ...;
// M = scale * exp(-2 pi j phase)
Eigen::VectorXd M = scale.binaryExpr(-2. * M_PI * phase,
[](double s, double p) noexcept -> std::complex<double> {
return std::polar(s, p);
});
If your goal is an approximation instead of a precise result, shouldn't your first step be to cast to single precision? Maybe after the range reduction to avoid losing too many decimal places. At the very least it will double the work done per clock cycle. Also, regular sine and cosine implementations take less time in float.
Edit
I had to correct myself on the cast to int64 instead of int. There is no vectorized conversion to int64_t until AVX512
The line (vidxt.array()*2+0.5).floor() bugs me slightly. This is meant to round down to negative infinity for [0, 0.5) and up to positive infinity for [0.5, 1), correct? vidxt is never negative. Therefore this line should be equivalent to (vidxt.array()*2).round(). With AVX2 and -ffast-math that saves one instruction. With SSE2 none of these actually vectorize, as can be seen on Godbolt

Multidimensional Integration - Coupled Limits

I need to calculate the value of a high dimensional integral in C++. I have found numerous libraries capable of solving this task for fixed limit integrals,
\int_{0}^{L} \int_{0}^{L} dx dy f(x,y) .
However the integrals which I am looking at have variable limits,
\int_{0}^{L} \int_{x}^{L} dx dy f(x,y) .
To clarify what i mean, here is a naive 2D Riemann sum implementation in 2D, which returns the desired result,
int steps = 100;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
where f is some arbitrary function and L the common upper integration limit. While this implementation works, it's slow and thus impractical for higher dimensions.
Effective algorithms for higher dimensions exist, but to my knowledge, library implementations (e.g. Cuba) take a fixed value vector as the limit argument which renders them useless for my problem.
Is there any reason for this and/or is there any trick to circumvent the problem?
Your integration order is wrong, should be dy dx.
You are integrating over the triangle
0 <= x <= y <= L
inside the square [0,L]x[0,L]. This can be simulated by integrating over the full square where the integrand f is defined as 0 outside of the triangle. In many cases, when f is defined on the full square, this can be accomplished by taking the product of f with the indicator function of the triangle as new integrand.
When integrating over a triangular region such as 0<=x<=y<=L one can take advantage of symmetry: integrate f(min(x,y),max(x,y)) over the square 0<=x,y<=L and divide the result by 2. This has an advantage over extending f by zero (the method mentioned by LutzL) in that the extended function is continuous, which improves the performance of the integration routine.
I compared these on the example of the integral of 2x+y over 0<=x<=y<=1. The true value of the integral is 2/3. Let's compare the performance; for demonstration purpose I use Matlab routine, but this is not specific to language or library used.
Extending by zero
f = #(x,y) (2*x+y).*(x<=y);
result = integral2(f, 0, 1, 0, 1);
fprintf('%.9f\n',result);
Output:
Warning: Reached the maximum number of function evaluations
(10000). The result fails the global error test.
0.666727294
Extending by symmetry
g = #(x,y) (2*min(x,y)+max(x,y));
result2 = integral2(g, 0, 1, 0, 1)/2;
fprintf('%.9f\n',result2);
Output:
0.666666776
The second result is 500 times more accurate than the first.
Unfortunately, this symmetry trick is not available for general domains; but integration over a triangle comes up often enough so it's useful to keep it in mind.
I was a bit confused by your integral definition but from your code i see it like this:
just did some testing so here is your code:
//---------------------------------------------------------------------------
double f(double *x) { return (x[0]+x[1]); }
void integral0()
{
double L=10.0;
int steps = 10000;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
}
//---------------------------------------------------------------------------
Here is optimized code:
//---------------------------------------------------------------------------
void integral1()
{
double L=10.0;
int i0,i1,steps = 10000;
double x[2]={0.0,0.0};
double integral,val,dl=L/((double)steps);
#define f(x) (x[0]+x[1])
integral=0.0;
for(x[0]= 0.0,i0= 0;i0<steps;i0++,x[0]+=dl)
for(x[1]=x[0],i1=i0;i1<steps;i1++,x[1]+=dl)
{
val=f(x);
integral+=val*val;
}
integral*=dl*dl;
#undef f
}
//---------------------------------------------------------------------------
results:
[ 452.639 ms] integral0
[ 336.268 ms] integral1
so the increase in speed is ~ 1.3 times (on 32bit app on WOW64 AMD 3.2GHz)
for higher dimensions it will multiply
but still I think this approach is slow
The only thing to reduce complexity I can think of is algebraically simplify things
either by integration tables or by Laplace or Z transforms
but for that the f(*x) must be know ...
constant time reduction can of course be done
by the use of multi-threading
and or GPU ussage
this can give you N times speed increase
because this is all directly parallelisable

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};
1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).
The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.
Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).
Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.
You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.
Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

How to optimize this CUDA kernel

I've profiled my model and it seems that this kernel accounts for about 2/3 of my total runtime. I was looking for suggestions to optimize it. The code is as follows.
__global__ void calcFlux(double* concs, double* fluxes, double* dt)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
fluxes[idx]=knowles_flux(idx, concs);
//fluxes[idx]=flux(idx, concs);
}
__device__ double knowles_flux(int r, double *conc)
{
double frag_term = 0;
double flux = 0;
if (r == ((maxlength)-1))
{
//Calculation type : "Max"
flux = -km*(r)*conc[r]+2*(ka)*conc[r-1]*conc[0];
}
else if (r > ((nc)-1))
{
//Calculation type : "F"
//arrSum3(conc, &frag_term, r+1, maxlength-1);
for (int s = r+1; s < (maxlength); s++)
{
frag_term += conc[s];
}
flux = -(km)*(r)*conc[r] + 2*(km)*frag_term - 2*(ka)*conc[r]*conc[0] + 2*(ka)*conc[r-1]*conc[0];
}
else if (r == ((nc)-1))
{
//Calculation type : "N"
//arrSum3(conc, &frag_term, r+1, maxlength-1);
for (int s = r+1; s < (maxlength); s++)
{
frag_term += conc[s];
}
flux = (kn)*pow(conc[0],(nc)) + 2*(km)*frag_term - 2*(ka)*conc[r]*conc[0];
}
else if (r < ((nc)-1))
{
//Calculation type : "O"
flux = 0;
}
return flux;
}
Just to give you an idea of why the for loop is an issue, this kernel is launched on an array of about maxlength = 9000 elements. For our purposes now, nc is in the range of 2-6. Here's an illustration of how this kernel processes the incoming array (conc). For this array, five different types of calculations need to be applied to different groups of elements.
Array element : 0 1 2 3 4 5 6 7 8 9 ... 8955 8956 8957 8958 8959 8960
Type of calc : M O O O O O N F F F ... F F F F F Max
The potential problems I've been trying to deal with right now are branch divergence from the quadruple if-else and the for loop.
My idea for dealing with the branch divergence is to break this kernel down into four separate device functions or kernels that treat each region separately and all launch at the same time. I'm not sure this is significantly better than just letting the branch divergence take place, which if I'm not mistaken, would cause the four calculation types to be run in serial.
To deal with the for loop, you'll notice that there's a commented out arrSum3 function, which I wrote based off my previously (and probably poorly) written parallel reduction kernel. Using it in place of the for loop drastically increased my runtime. I feel like there's a clever way to accomplish what I'm trying to do with the for loop, but I'm just not that smart and my advisor is tired of me "wasting time" thinking about it.
Appreciate any help.
EDIT
Full code is located here : https://stackoverflow.com/q/21170233/1218689
Assuming sgn() and abs() are not derived from "if"s and "else"s
__device__ double knowles_flux(int r, double *conc)
{
double frag_term = 0;
double flux = 0;
//Calculation type : "Max"
//no divergence
//should prefer 20-30 extra cycles instead of a branching.
//may not be good for CPU
fluxA = (1-abs(sgn(r-(maxlength-1)))) * (-km*(r)*conc[r]+2*(ka)*conc[r-1]*conc[0]);
//is zero if r and maxlength-1 are not equal
//always compute this in shared memory so work will be equal for all cores, no divergence
// you should divide kernel into several pieces to do a reduction
// but if you dont want that, then you can try :
for (int s = 0;s<someLimit ; s++) // all count for same number of cycles so no divergence
{
frag_term += conc[s] * ( abs(sgn( s-maxlength ))*sgn(1- sgn( s-maxlength )) )* ( sgn(1+sgn(s-(r+1))) );
}
//but you can make easier of this using "add and assign" operation
// in local memory (was it __shared in CUDA?)
// global conc[] to local concL[] memory(using all cores)(100 cycles)
// for(others from zero to upper_limit)
// if(localID==0)
// {
// frag_termL[0]+=concL[s] // local to local (10 cycles/assign.)
// frag_termL[0+others]=frag_termL[0]; // local to local (10 cycles/assign.)
// } -----> uses nearly same number of cycles but uses much less energy
//using single core (2000 instr. with single core vs 1000 instr. with 2k cores)
// in local memory, then copy it to private registers accordingly using all cores
//Calculation type : "F"
fluxB = ( abs(sgn(r-(nc-1)))*sgn(1+sgn(r-(nc-1))) )*(-(km)*(r)*conc[r] + 2*(km)*frag_term - 2*(ka)*conc[r]*conc[0] + 2*(ka)*conc[r-1]*conc[0]);
// is zero if r is not greater than (nc-1)
//Calculation type : "N"
fluxC = ( 1-abs(sgn(r-(nc-1))) )*((kn)*pow(conc[0],(nc)) + 2*(km)*frag_term - 2*(ka)*conc[r]*conc[0]);
//zero if r and nc-1 are not equal
flux=fluxA+fluxB+fluxC; //only one of these can be different than zero
flux=flux*( -sgn(r-(nc-1))*sgn(1-sgn(r-(nc-1))) )
//zero if r > (nc-1)
return flux;
}
Okay, let me open a bit:
if(a>b) x+=y;
can be taken as
if a-b is negative sgn(a-b) is -1
then adding 1 to that -1 gives zero ==> satisfies lower part of comparison(a<b)
x+= (sgn(a-b) +1) = 0 if a<b (not a>b), x unchanged
if(a-b) is zero, sgn(a-b) is zero
then we should multiply the upper solution with sgn(a-b) too!
x+= y*(sgn(a-b) +1)*sgn(a-b)
means
x+= y*( 0 + 1) * 0 = 0 a==b is satisfied too!
lets check what happens if a>b
x+= y*(sgn(a-b) +1)*sgn(a-b)
x+= y*(1 +1)*1 ==> y*2 is not acceptable, needs another sgn on outherside
x+= y* sgn((sgn(a-b)+1)*sgn(a-b))
x+= y* sgn((1+1)*1)
x+= y* sgn(2)
x+= y only when a is greater than b
when there are too many
abs(sgn(r-(nc-1))
then you can re-use it as
tmp=abs(sgn(r-(nc-1))
..... *tmp*(tmp-1) ....
...... +tmp*zxc[s] .....
...... ......
to decrease total cycles even more! Register accessing can be in the level of terabytes/s so shouldnt be a problem. Just as doing that for global access:
tmpGlobal= conc[r];
...... tmpGlobal * tmp .....
.... tmpGlobal +x -y ....
all private registers doing stuff in terabytes per second.
Warning: reading from conc[-1] shouldnt cause any faults as long as it is multiplied by zero if the real address of conc[0] is not real zero already . But writing is hazardous.
if you need to escape from conc[-1] anyway, you can multiply the index with some absolut-ified value too! See:
tmp=conc[i-1] becomes tmp=conc[abs((i-1))] will always read from positive index, the value will be multiplied by zero later anyway. This was lower bound protection.
You can apply a higher bound protection too. Just this adds even more cycles.
Think about using vector-shuffle operations if working on a pure scalar values is not fast enough when accessing conc[r-1] and conc[r+1]. Shuffle operation between a vector's elements is faster than copying it through local mem to another core/thread.

Efficient way to compute geometric mean of many numbers

I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be
double geometric_mean(std::vector<double> const&data) // failure
{
auto product = 1.0;
for(auto x:data) product *= x;
return std::pow(product,1.0/data.size());
}
However, this may well fail because of underflow or overflow in the accumulated product (note: long double doesn't really avoid this problem). So, the next option is to sum-up the logarithms:
double geometric_mean(std::vector<double> const&data)
{
auto sumlog = 0.0;
for(auto x:data) sum_log += std::log(x);
return std::exp(sum_log/data.size());
}
This works, but calls std::log() for every element, which is potentially slow. Can I avoid that? For example by keeping track of (the equivalent of) the exponent and the mantissa of the accumulated product separately?
The "split exponent and mantissa" solution:
double geometric_mean(std::vector<double> const & data)
{
double m = 1.0;
long long ex = 0;
double invN = 1.0 / data.size();
for (double x : data)
{
int i;
double f1 = std::frexp(x,&i);
m*=f1;
ex+=i;
}
return std::pow( std::numeric_limits<double>::radix,ex * invN) * std::pow(m,invN);
}
If you are concerned that ex might overflow you can define it as a double instead of a long long, and multiply by invN at every step, but you might lose a lot of precision with this approach.
EDIT For large inputs, we can split the computation in several buckets:
double geometric_mean(std::vector<double> const & data)
{
long long ex = 0;
auto do_bucket = [&data,&ex](int first,int last) -> double
{
double ans = 1.0;
for ( ;first != last;++first)
{
int i;
ans *= std::frexp(data[first],&i);
ex+=i;
}
return ans;
};
const int bucket_size = -std::log2( std::numeric_limits<double>::min() );
std::size_t buckets = data.size() / bucket_size;
double invN = 1.0 / data.size();
double m = 1.0;
for (std::size_t i = 0;i < buckets;++i)
m *= std::pow( do_bucket(i * bucket_size,(i+1) * bucket_size),invN );
m*= std::pow( do_bucket( buckets * bucket_size, data.size() ),invN );
return std::pow( std::numeric_limits<double>::radix,ex * invN ) * m;
}
I think I figured out a way to do it, it combined the two routines in the question, similar to Peter's idea. Here is an example code.
double geometric_mean(std::vector<double> const&data)
{
const double too_large = 1.e64;
const double too_small = 1.e-64;
double sum_log = 0.0;
double product = 1.0;
for(auto x:data) {
product *= x;
if(product > too_large || product < too_small) {
sum_log+= std::log(product);
product = 1;
}
}
return std::exp((sum_log + std::log(product))/data.size());
}
The bad news is: this comes with a branch. The good news: the branch predictor is likely to get this almost always right (the branch should only rarely be triggered).
The branch could be avoided using Peter's idea of a constant number of terms in the product. The problem with that is that overflow/underflow may still occur within only a few terms, depending on the values.
You may be able to accelerate this by multiplying numbers as in your original solution and only converting to logarithms every certain number of multiplications (depending on the size of your initial numbers).
A different approach which would give better accuracy and performance than the logarithm method would be to compensate out-of-range exponents by a fixed amount, maintaining an exact logarithm of the cancelled excess. Like so:
const int EXP = 64; // maximal/minimal exponent
const double BIG = pow(2, EXP); // overflow threshold
const double SMALL = pow(2, -EXP); // underflow threshold
double product = 1;
int excess = 0; // number of times BIG has been divided out of product
for(int i=0; i<n; i++)
{
product *= A[i];
while(product > BIG)
{
product *= SMALL;
excess++;
}
while(product < SMALL)
{
product *= BIG;
excess--;
}
}
double mean = pow(product, 1.0/n) * pow(BIG, double(excess)/n);
All multiplications by BIG and SMALL are exact, and there's no calls to log (a transcendental, and therefore particularly imprecise, function).
There is simple idea to reduce computation and also to prevent overflow. You can group together numbers say atleast two at time and calculate their log and then evaluate their sum.
log(abcde) = 5*log(K)
log(ab) + log(cde) = 5*log(k)
Summing logs to compute products stably is perfectly fine, and rather efficient (if this is not enough: there are ways to get vectorized logarithms with a few SSE operations -- there are also Intel MKL's vector operations).
To avoid overflow, a common technique is to divide every number by the maximum or minimum magnitude entry beforehand (or sum log differences to the log max or log min). You can also use buckets if the numbers vary a lot (eg. sum the log of small numbers and large numbers separately). Note that typically neither of this is needed except for very large sets since the log of a double is never huge (between say -700 and 700).
Also, you need to keep track of the signs separately.
Computing log x keeps typically the same number of significant digits as x, except when x is close to 1: you want to use std::log1p if you need to compute prod(1 + x_n) with small x_n.
Finally, if you have roundoff error problems when summing, you can use Kahan summation or variants.
Instead of using logarithms, which are very expensive, you can directly scale the results by powers of two.
double geometric_mean(std::vector<double> const&data) {
double huge = scalbn(1,512);
double tiny = scalbn(1,-512);
int scale = 0;
double product = 1.0;
for(auto x:data) {
if (x >= huge) {
x = scalbn(x, -512);
scale++;
} else if (x <= tiny) {
x = scalbn(x, 512);
scale--;
}
product *= x;
if (product >= huge) {
product = scalbn(product, -512);
scale++;
} else if (product <= tiny) {
product = scalbn(product, 512);
scale--;
}
}
return exp2((512.0*scale + log2(product)) / data.size());
}