I would like to know under which conditions invariant parts of nested loops can be optimized.
For doing so, I wrote two functions one of which implements the factorization of three nested loops while the other doesn't.
The non-factorized function looks like:
template<int k>
double __attribute__ ((noinline)) evaluate(const double u[], const double phi[])
{
double f = 0.;
for (int i3 = 0;i3<k;++i3)
for (int i2 = 0;i2<k;++i2)
for (int i1 = 0;i1<k;++i1)
f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3];
return f;
}
While the factorized function is:
template<int k>
double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[])
{
double f3 = 0.;
for (int i3 = 0;i3<k;++i3)
{
double f2 = 0.;
for (int i2 = 0;i2<k;++i2)
{
double f1 = 0.;
for (int i1 = 0;i1<k;++i1)
{
f1 += u[i1+k*(i2+k*i3)] * phi[i1];
}
f2 += f1 * phi[i2];
}
f3 += f2 * phi[i3];
}
return f3;
}
That I call with the following main:
int main()
{
const static unsigned int k=20;
double u[k*k*k];
double phi[k];
phi[0] = 1.;
for (unsigned int i=1;i<k;++i)
phi[i] = phi[i-1]*.333;
double e = 0.;
for (unsigned int i=0;i<1000;++i)
{
e += evaluate<k>(u, phi);
//e += evaluate_fact<k>(u, phi);
}
std::cout << "Evaluate " << e << std::endl;
}
For a small k both functions generate the same assembly code but after a certain size k~=10 the assembly does not look the same anymore and callgrind shows more operations being performed in the non-factorized version.
How should I write my code (if at all possible), or what should I tell GCC such that evaluate() is optimized to evaluate_fact() ???
I am using GCC 7.1.0. with flags -Ofast -fmove-loop-invariants
Using -funroll-loops does not help unless I add --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 but that is a completely different thing because it is basically unrolling everything, the assembly is extensive.
Using -fassociative-math doesn't help either.
This paper claims that: "Traditional loop-invariant code motion, which is commonly applied by general-purpose compilers, only checks invariance with respect to the innermost loop." Does that apply to my code?
Thanks!
Related
2nd task:
For a function f : R^n → R the gradient at a point ~x ∈ R^n is to be calculated:
- Implement a function
CMyVector gradient(CMyVector x, double (*function)(CMyVector x)),
which is given in the first parameter the location ~x and in the second parameter the function f as function pointer in the second parameter, and which calculates the gradient ~g = grad f(~x) numerically
by
gi = f(x1, . . . , xi-1, xi + h, xi+1 . . . , xn) - f(x1, . . . , xn)/h
to fixed h = 10^-8.
My currently written program:
Header
#pragma once
#include <vector>
#include <math.h>
class CMyVektor
{
private:
/* data */
int Dimension = 0;
std::vector<double>Vector;
public:
CMyVektor();
~CMyVektor();
//Public Method
void set_Dimension(int Dimension /* Aktuelle Dim*/);
void set_specified_Value(int index, int Value);
double get_specified_Value(int key);
int get_Vector_Dimension();
int get_length_Vektor();
double& operator [](int index);
string umwandlung()
};
CMyVektor::CMyVektor(/* args */)
{
Vector.resize(0, 0);
}
CMyVektor::~CMyVektor()
{
for (size_t i = 0; i < Vector.size(); i++)
{
delete Vector[i];
}
}
void CMyVektor::set_Dimension(int Dimension /* Aktuelle Dim*/)
{
Vector.resize(Dimension);
};
void CMyVektor::set_specified_Value(int index, int Value)
{
if (Vector.empty())
{
Vector.push_back(Value);
}
else {
Vector[index] = Value;
}
};
double CMyVektor::get_specified_Value(int key)
{
// vom intervall anfang - ende des Vectors
for (unsigned i = 0; i < Vector.size(); i++)
{
if (Vector[i] == key) {
return Vector[i];
}
}
};
int CMyVektor::get_Vector_Dimension()
{
return Vector.size();
};
// Berechnet den Betrag "länge" eines Vectors.
int CMyVektor::get_length_Vektor()
{
int length = 0;
for (size_t i = 0; i < Vector.size(); i++)
{
length += Vector[i]^2
}
return sqrt(length);
}
// [] Operator überladen
double& CMyVektor::operator [](int index)
{
return Vector[index];
}
main.cpp
#include <iostream>
#include "ClassVektor.h"
using namespace std;
CMyVektor operator+(CMyVektor a, CMyVektor b);
CMyVektor operator*(double lambda, CMyVektor a);
CMyVektor gradient(CMyVektor x, double (*funktion)(CMyVektor x));
int main() {
CMyVektor V1;
CMyVektor V2;
CMyVektor C;
C.set_Dimension(V1.get_length_Vector());
C= V1 + V2;
std::cout << "Addition : "<< "(";;
for (int i = 0; i < C.get_length_Vector(); i++)
{
std::cout << C[i] << " ";
}
std::cout << ")" << endl;
C = lamda * C;
std::cout << "Skalarprodukt: "<< C[0]<< " ";
}
// Vector Addition
CMyVektor operator+(CMyVektor a, CMyVektor b)
{
int ai = 0, bi = 0;
int counter = 0;
CMyVektor c;
c.set_Dimension(a.get_length_Vector());
// Wenn Dimension Gleich dann addition
if (a.get_length_Vector() == b.get_length_Vector())
{
while (counter < a.get_length_Vector())
{
c[counter] = a[ai] + b[bi];
counter++;
}
return c;
}
}
//Berechnet das Skalarprodukt
CMyVektor operator*(double lambda, CMyVektor a)
{
CMyVektor c;
c.set_Dimension(1);
for (unsigned i = 0; i < a.get_length_Vector(); i++)
{
c[0] += lambda * a[i];
}
return c;
}
/*
* Differenzenquotient : (F(x0+h)+F'(x0)) / h
* Erster Parameter die Stelle X - Zweiter Parameter die Funktion
* Bestimmt numerisch den Gradienten.
*/
CMyVektor gradient(CMyVektor x, double (*funktion)(CMyVektor x))
{
}
My problem now is that I don't quite know how to deal with the
CMyVector gradient(CMyVector x, double (*function)(CMyVector x))
function and how to define a function that corresponds to it.
I hope that it is enough information. Many thanks.
The function parameter is the f in the difference formula. It takes a CMyVector parameter x and returns a double value. You need to supply a function parameter name. I'll assume func for now.
I don't see a parameter for h. Are you going to pass a single small value into the gradient function or assume a constant?
The parameter x is a vector. Will you add a constant h to each element?
This function specification is a mess.
Function returns a double. How do you plan to turn that into a vector?
No wonder you're confused. I am.
Are you trying to do something like this?
You are given a function signature
CMyVector gradient(CMyVector x, double (*function)(CMyVector x))
Without knowing the exact definition I will assume, that at least the basic numerical vector operations are defined. That means, that the following statements compile:
CMyVector x {2.,5.,7.};
CMyVector y {1.,7.,4.};
CMyVector z {0.,0.,0.};
double a = 0.;
// vector addition and assigment
z = x + y;
// vector scalar multiplication and division
z = z * a;
z = x / 0.1;
Also we need to know the dimension of the CMyVector class. I assumed and will continue to do so that it is three dimensional.
The next step is to understand the function signature. You get two parameters. The first one denotes the point, at which you are supposed to calculate the gradient. The second is a pointer to the function f in your formula. You do not know it, but can call it on a vector from within your gradient function definition. That means, inside of the definition you can do something like
double f_at_x = function(x);
and the f_at_x will hold the value f(x) after that operation.
Armed with this, we can try to implement the formula, that you mentioned in the question title:
CMyVector gradient(CMyVector x, double (*function)(CMyVector x)) {
double h = 0.001;
// calculate first element of the gradient
CMyVector e1 {1.0, 0.0, 0.0};
double result1 = ( function(x + e1*h) - function(x) )/h;
// calculate second element of the gradient
CMyVector e2 {0.0, 1.0, 0.0};
double result2 = ( function(x + e2*h) - function(x) )/h;
// calculate third element of the gradient
CMyVector e3 {0.0, 0.0, 1.0};
double result3 = ( function(x + e3*h) - function(x) )/h;
// return the result
return CMyVector {result1, result2, result3};
}
There are several thing worth to mention in this code. First and most important I have chosen h = 0.001. This may like a very arbitrary choice, but the choice of the step size will very much impact the precision of your result. You can find a whole lot of discussion about that topic here. I took the same value that according to that wikipedia page a lot of handheld calculators use internally. That might not be the best choice for the floating point precision of your processor, but should be a fair one to start with.
Secondly the code looks very ugly for an advanced programmer. We are doing almost the same thing for each of the three dimensions. Ususally you would like to do that in a for loop. The exact way of how this is done depends on how the CMyVector type is defined.
Since the CMyVektor is just rewritting the valarray container, I will directly use the valarray:
#include <iostream>
#include <valarray>
using namespace std;
using CMyVektor = valarray<double>;
CMyVektor gradient(CMyVektor x, double (*funktion)(CMyVektor x));
const double h = 0.00000001;
int main()
{
// sum(x_i^2 + x_i)--> gradient: 2*x_i + 1
auto fun = [](CMyVektor x) {return (x*x + x).sum();};
CMyVektor d = gradient(CMyVektor{1,2,3,4,5}, fun);
for (auto i: d) cout << i<<' ';
return 0;
}
CMyVektor gradient(CMyVektor x, double (*funktion)(CMyVektor x)){
CMyVektor grads(x.size());
CMyVektor pos(x.size());
for (int i = 0; i<x.size(); i++){
pos[i] = 1;
grads[i] = (funktion(x + h * pos) - funktion(x))/ h;
pos[i] = 0;
}
return grads;
}
The prints out 3 5 7 9 11 which is what is expected from the given function and the given location
Consider this struct which can for example represent a structure of 2 4D vectors:
struct A {
double x[4];
double y[4];
A() : A(0.0, 0.0) { }
A(double xp, double yp)
{
std::fill_n(x, 4, xp);
std::fill_n(y, 4, yp);
}
// Simple element-wise delegation of the mathematical operations
friend A operator+(const A &l, const A &r)
{
A res;
for (int i = 0; i < 4; i++)
{
res.x[i] = l.x[i] + r.x[i];
res.y[i] = l.y[i] + r.y[i];
}
return res;
}
friend A operator*(const A &l, const double &r)
{
A res;
for (int i = 0; i < 4; i++)
{
res.x[i] = l.x[i] * r;
res.y[i] = l.y[i] * r;
}
return res;
}
friend A operator*(const double &l, const A &r)
{
A res;
for (int i = 0; i < 4; i++)
{
res.x[i] = l * r.x[i];
res.y[i] = l * r.y[i];
}
return res;
}
friend std::ostream &operator<<(std::ostream &stream, const A &a)
{
for (int i = 0; i < 4; i++)
std::cout << "(" << a.x[i] << "|" << a.y[i] << ") ";
return stream;
}
};
For convenience, the struct has a few operators defined, which simply delegate to member- and element-wise operations.
Now, consider two different versions of a second struct B, that contains objects of A:
struct B { // version 1
double f1;
double f2; // Two coefficients
A buff1;
A buff2;
A buffa[4]; // Objects of struct A
// The following functions use the operators defined on struct A
void mathA(int i, double d) // Some math operations
{
buff2 = buff1 + buffa[i] * d;
}
void mathB() // Some more math (vector) operations
{
buff1 = f1 * (buffa[0] + buffa[3]) + f2 * (buffa[1] + buffa[2]);
}
};
and
struct B { // version 2
double f1;
double f2; // Two coefficients
A buff1;
A buff2;
A buffa[4]; // Objects of struct A
// The following functions DO NOT use the operators defined on struct A
void mathA(int i, double d) // Some math operations
{
for (int j = 0; j < 4; j++)
{
buff2.x[j] = buff1.x[j] + buffa[i].x[j] * d;
buff2.y[j] = buff1.y[j] + buffa[i].y[j] * d;
}
}
void mathB() // Some more math (vector) operations
{
for (int j = 0; j < 4; j++)
{
buff1.x[j] = f1 * (buffa[0].x[j] + buffa[3].x[j]) + f2 * (buffa[1].x[j] + buffa[2].x[j]);
buff1.y[j] = f1 * (buffa[0].y[j] + buffa[3].y[j]) + f2 * (buffa[1].y[j] + buffa[2].y[j]);
}
}
};
As you can see, the secont version of struct B performs the same mathematical operations, but the first version uses the operators of struct A while the second performs these operations manually in mathA and mathB. Note that the second version of struct B does not actually use the operators defined in struct A.
Let's add a main function to test the functionality of struct B (Diff window, "Left"):
int main(int argc, char **argv)
{
B b;
b.f1 = 0.5;
b.f2 = 0.8;
b.buff1 = A(0.7, 0.8);
b.buff2 = A(1.7, 2.8);
b.mathA(1, 0.9);
b.mathB();
std::cout << b.buff1 << "\n" << b.buff2;
}
I have prepared examples of both cases in godbolt here. Both cases are compiled using g++ 7.1.0 on optimization level -O3. The left case corresponds to version 1, the right case to version 2 of struct B.
As you can see in the disassembly, the compiler generates two labels for version 1, which correspond to the mathX functions in struct B:
64 B::mathA(int, double):
[…]
76 B::mathB():
As my analysis shows, the first example is much slower compared to the second example. The functions are called more than 1 billion times in my actual code and are thus very much contributing to the overall runtime. I assume this is partially due to the jumps to the function definitions.
Is there a way to force the compiler to produce an assembly that is identical to the second example? I.e. with using the definitions of the operators?
Update
Since the compiler seemed to generate labels and jumps for mathX(…) my idea was to attempt to inline these functions. Using the inline keyword changed nothing, but for g++ you can use __attribute__((always_inline)) which will force the compiler to inline the function (documentation):
struct B { // version 3
// …
mathA(int i, double d) __attribute__((always_inline))
{
// …
}
mathB() __attribute__((always_inline))
{
// …
}
};
This improved the performance, which is now somewhere between version 1 and version 2. This is still not perfect, but if no better solution will be found, I will go with this one.
I spent quiet some time looking on the internet to find a solution to this, maybe it's out there but nothing of what I saw helped me.
I have a function !
double integrand(double r, double phi, double theta)
That I want to integrate with some given definite bounds over the three dimensions. I found multiple lines of code on the internet that implement single variable definite integrals numerical schemes. I was thinking to myself "well, I'll just integrate along one dimension after the other".
Algorithmically speaking what I wanted to do was :
double firstIntegral(double r, double phi) {
double result = integrationFunction(integrand,lower_bound,upper_bound);
return result;
}
And simply do it again two more times. This works easily in languages like Matlab where I can create functions handler anywhere but I don't know how to do it in C++. I would have to first define a function that some r and phi will calculate integrand(r, phi, theta) for any theta and make it in C++ a function of one variable only but I don't know how to do that.
How can I compute the triple integral of my three-variables function in C++ using a one -dimensional integration routine (or anything else really...) ?
This is a very slow and inexact version for integrals over cartesian coordinates, which should work with C++11.
It is using std::function and lambdas to implement the numerical integration. No steps have been taken to optimize this.
A template based solution could be much faster (by several orders of magnitude) than this, because it may allow the compiler to inline and simplify some of the code.
#include<functional>
#include<iostream>
static double integrand(double /*x*/, double y, double /*z*/)
{
return y;
}
double integrate_1d(std::function<double(double)> const &func, double lower, double upper)
{
static const double increment = 0.001;
double integral = 0.0;
for(double x = lower; x < upper; x+=increment) {
integral += func(x) * increment;
}
return integral;
}
double integrate_2d(std::function<double(double, double)> const &func, double lower1, double upper1, double lower2, double upper2)
{
static const double increment = 0.001;
double integral = 0.0;
for(double x = lower2; x < upper2; x+=increment) {
auto func_x = [=](double y){ return func(x, y);};
integral += integrate_1d(func_x, lower1, upper1) * increment;
}
return integral;
}
double integrate_3d(std::function<double(double, double, double)> const &func,
double lower1, double upper1,
double lower2, double upper2,
double lower3, double upper3)
{
static const double increment = 0.001;
double integral = 0.0;
for(double x = lower3; x < upper3; x+=increment) {
auto func_x = [=](double y, double z){ return func(x, y, z);};
integral += integrate_2d(func_x, lower1, upper1, lower2, upper2) * increment;
}
return integral;
}
int main()
{
double integral = integrate_3d(integrand, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0);
std::cout << "Triple integral: " << integral << std::endl;
return 0;
}
You can use functors
#include <iostream>
struct MyFunctorMultiply
{
double m_coeff;
MyFunctorMultiply(double coeff)
{
m_coeff = coeff;
}
double operator()(double value)
{
return m_coeff * value;
}
};
struct MyFunctorAdd
{
double m_a;
MyFunctorAdd(double a)
{
m_a = a;
}
double operator()(double value)
{
return m_a + value;
}
};
template<class t_functor>
double calculate(t_functor functor, double value, double other_param)
{
return functor(value) - other_param;
}
int main()
{
MyFunctorMultiply multiply2(2.);
MyFunctorAdd add3(3.);
double result_a = calculate(multiply2, 4, 1); // should obtain 4 * 2 - 1 = 7
double result_b = calculate(add3, 5, 6); // should obtain 5 + 3 - 6 = 2
std::cout << result_a << std::endl;
std::cout << result_b << std::endl;
}
If your concern is just about getting the right prototype to pass to the integration function, you can very well use alternative data passing mechanisms, the simpler of which is using global variables.
Assuming that the order of integration is on theta, then phi, then r, write three functions of a single argument:
It(theta) computes the integrand from the argument theta passed explicitly and the global phi and r.
Ip(phi) computes the bounds on theta from the argument phi passed explicitly and the global r; it also copies the phi argument to the global variable and invokes integrationFunction(It, lower_t, upper_t).
Ir(r) computes the bounds on phi from the argument r passed explicitly; it also copies the r argument to the global variable and invokes integrationFunction(Ip, lower_p, upper_p).
Now you are ready to call integrationFunction(Ir, lower_r, upper_r).
It may also be that integrationFunction supports a "context" argument where you can store what you want.
I am trying to calculate complex numbers for a 2D array in C++. The code is running very slowly and I have narrowed down the main cause to be the exp function (the program runs quickly when I comment out that line, even though I have 4 nested loops).
int main() {
typedef vector< complex<double> > complexVect;
typedef vector<double> doubleVect;
const int SIZE = 256;
vector<doubleVect> phi_w(SIZE, doubleVect(SIZE));
vector<complexVect> phi_k(SIZE, complexVect(SIZE));
complex<double> i (0, 1), cmplx (0, 0);
complex<double> temp;
int x, y, t, k, w;
double dk = 2.0*M_PI / (SIZE-1);
double dt = M_PI / (SIZE-1);
int xPos, yPos;
double arg, arg2, arg4;
complex<double> arg3;
double angle;
vector<complexVect> newImg(SIZE, complexVect(SIZE));
for (x = 0; x < SIZE; ++x) {
xPos = -127 + x;
for (y = 0; y < SIZE; ++y) {
yPos = -127 + y;
for (t = 0; t < SIZE; ++t) {
temp = cmplx;
angle = dt * t;
arg = xPos * cos(angle) + yPos * sin(angle);
for (k = 0; k < SIZE; ++k) {
arg2 = -M_PI + dk*k;
arg3 = exp(-i * arg * arg2);
arg4 = abs(arg) * M_PI / (abs(arg) + M_PI);
temp = temp + arg4 * arg3 * phi_k[k][t];
}
}
newImg[y][x] = temp;
}
}
}
Is there a way I can improve computation time? I have tried using the following helper function but it doesn't noticeably help.
complex<double> complexexp(double arg) {
complex<double> temp (sin(arg), cos(arg));
return temp;
}
I am using clang++ to compile my code
edit: I think the problem is the fact that I'm trying to calculate complex numbers. Would it be faster if I just used Euler's formula to calculate the real and imaginary parts in separate arrays and not have to deal with the complex class?
maybe this will work for you:
http://martin.ankerl.com/2007/02/11/optimized-exponential-functions-for-java/
I've had a look with callgrind. The only marginal improvement (~1.3% with size = 50) I could find was to change:
temp = temp + arg4 * arg3 * phi_k[k][t];
to
temp += arg4 * arg3 * phi_k[k][t];
The most costly function calls were sin()/cos(). I suspect that calling exp() with a complex number argument calls those functions in the background.
To retain precision, the function will compute very slowly and there doesn't seem to be a way around it. However, you could trade precision for accuracy, which seems to be what game developers would do: sin and cos are slow, is there an alternatve?
You can define number e as a constant and use std::pow() function
Quick question related to IIR filter coefficients. Here is a very typical implementation of a direct form II biquad IIR processor that I found online.
// b0, b1, b2, a1, a2 are filter coefficients
// m1, m2 are the memory locations
// dn is the de-denormal coeff (=1.0e-20f)
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
register float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
I understand that the "register" is somewhat unnecessary given how smart modern compilers are about this kind of thing. My question is, are there any potential performance benefits to storing the filter coefficients in individual variables rather than using arrays and dereferencing the values? Would the answer to this question depend on the target platform?
i.e.
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
versus
out[i] = b1*m1 + b2*m2 + b0*w;
It really depends on your compiler and the optimization options. Here is my take:
Any modern compiler would just ignore register. It is just a hint to the compiler and modern ones just don't use it.
Accessing constant indexes in a loop is usually optimized away when compiling with optimization on. In a sense, using variables or an array as you showed makes no difference.
Always, always run benchmarks and look at the generated code for performance critical sections of the code.
EDIT: OK, just out of curiosity I wrote a small program and got "identical" code generated when using full optimization with VS2010. Here is what I get inside the loop for the expression in question (exactly identical for both cases):
0128138D fmul dword ptr [eax+0Ch]
01281390 faddp st(1),st
01281392 fld dword ptr [eax+10h]
01281395 fld dword ptr [w]
01281398 fld st(0)
0128139A fmulp st(2),st
0128139C fxch st(2)
0128139E faddp st(1),st
012813A0 fstp dword ptr [ecx+8]
Notice that I added a few lines to output the results so that I make sure compiler does not just optimize away everything. Here is the code:
#include <iostream>
#include <iterator>
#include <algorithm>
class test1
{
float a1, a2, b0, b1, b2;
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
class test2
{
float a[2], b[3];
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a[0]*m1 - a[1]*m2 + dn;
out[i] = b[0]*m1 + b[1]*m2 + b[2]*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
test1 t1;
test2 t2;
float a[1000];
float b[1000];
t1.processBiquad(a, b, 1000);
t2.processBiquad(a, b, 1000);
std::copy(b, b+1000, std::ostream_iterator<float>(std::cout, " "));
return 0;
}
I am not sure, but this :
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
might be worse, because it would compile to indirect access, and that is worse then direct access performance wise.
The only way to actually see, is to check the compiled assembler and profile the code.
You will likely get a benefit if you can declare the coefficients b0, b1, b2 as const. Code will be more efficient if any of your operands are known and fixed at compile time.