I have the following loop for a monte-carlo computation I am performing:
the variables below are pre-computed/populated and is defined as:
w_ = std::vector<std::vector<double>>(150000, std::vector<double>(800));
C_ = Eigen::MatrixXd(800,800);
Eigen::VectorXd a(800);
Eigen::VectorXd b(800);
The while loop is taking me about 570 seconds to compute.Just going by the the loops I understand that I have nPaths*m = 150,000 * 800 = 120,000,000 sets of computations happening (I have not taken into account the cdf computations handled by boost libraries).
I am a below average programmer and was wondering if there are any obvious mistakes which I am making which maybe slowing the computation down. Or is there any other way to handle the computation which can speed things up.
int N(0);
int nPaths(150000);
int m(800);
double Varsum(0.);
double err;
double delta;
double v1, v2, v3, v4;
Eigen::VectorXd d = Eigen::VectorXd::Zero(m);
Eigen::VectorXd e = Eigen::VectorXd::Zero(m);
Eigen::VectorXd f = Eigen::VectorXd::Zero(m);
Eigen::VectorXd y;
y0 = Eigen::VectorXd::Zero(m);
boost::math::normal G(0, 1.);
d(0) = boost::math::cdf(G, a(0) / C_(0, 0));
e(0) = boost::math::cdf(G, b(0) / C_(0, 0));
f(0) = e(0) - d(0);
while (N < (nPaths-1))
{
y = y0;
for (int i = 1; i < m; i++)
{
v1 = d(i - 1) + w_[N][(i - 1)]*(e(i - 1) - d(i - 1));
y(i - 1) = boost::math::quantile(G, v1);
v2 = (a(i) - C_.row(i).dot(y)) / C_(i, i);
v3 = (b(i) - C_.row(i).dot(y)) / C_(i, i);
d(i) = boost::math::cdf(G, v2);
e(i) = boost::math::cdf(G, v3);
f(i) = (e(i) - d(i))*f(i - 1);
}
N++;
delta = (f(m-1) - Intsum) / N;
Intsum += delta;
Varsum = (N - 2)*Varsum / N + delta*delta;
err = alpha_*std::sqrt(Varsum);
}
If I understand your code right, the running time is actually O(nPaths*m*m)=10^11, due to the dot-product C_.row(i).dot(y) which needs O(m) operation.
You could speed up the program by factor of two by not calculating it twice:
double prod=C_.row(i).dot(y)
v2 = (a(i) - prod) / C_(i, i);
v3 = (b(i) - prod) / C_(i, i);
but maybe compiler already does it for you.
The other thing is that y consists of zeros (at least at the beginning) so you don't have to do the full dot-product but only until current value of i. That should give another factor 2 speed up.
So taken into the account the sheer number of operation your timings are not so bad. There is some room for improvement of the code, but if you are interested in speeding up some orders of magnitude you probably should be thinking about changing your formulation.
Related
Well, I had task to create function that does Fourier series with some mathematical function, so I found all the formulas, but the main problem is when I change count of point on some interval to draw those series I have very strange artifact:
This is Fourier series of sin(x) on interavl (-3.14; 314) with 100 point for tabulation
And this is same function with same interval but with 100000 points for tabulation
Code for Fourier series coeficients:
void fourieSeriesDecompose(std::function<double(double)> func, double period, long int iterations, double *&aParams, double *&bParams){
aParams = new double[iterations];
aParams[0] = integrateRiemans(func, 0, period, 1000);
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * cos((2 * x * i * M_PI) / period)); };
aParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
bParams = new double[iterations];
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * sin(2 * (x * (i + 1) * M_PI) / period)); };
bParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
}
This code I use to reproduce function using found coeficients:
double fourieSeriesCompose(double x, double period, long iterations, double *aParams, double *bParams){
double y = aParams[0];
for(int i = 1; i < iterations; i++){
y += sqrt(aParams[i] * aParams[i] + bParams[i] * bParams[i]) * cos((2 * i * x * M_PI) / period - atan(bParams[i] / aParams[i]));
}
return y;
}
And the runner code
double period = M_PI * 2;
auto startFunc = [](double x) -> double{ return sin(x); };
fourieSeriesDecompose(*startFunc, period, 1000, aCoeficients, bCoeficients);
auto readyFunc = [&](double x) -> double{ return fourieSeriesCompose(x, period, 1000, aCoeficients, bCoeficients); };
tabulateFunc(readyFunc);
scaleFunc();
//Draw methods after this
see:
How to compute Discrete Fourier Transform?
So if I deciphered it correctly the aParams,bParams represent the real and imaginary part of the result then the angles in sin and cos must be the same but you have different! You got this:
auto sineFunc = [&](double x) -> double { return 2*(func(x)*cos((2* x* i *M_PI)/period));
auto sineFunc = [&](double x) -> double { return 2*(func(x)*sin( 2*(x*(i+1)*M_PI)/period));
as you can see its not the same angle. Also what is period? You got iterations! if it is period of the function you want to transform then it should be applied to it and not to the kernel ... Also integrateRiemans does what? its the nested for loop to integrate the furrier transform? Btw. hope that func is real domain otherwise the integration/sumation needs both real and imaginary part not just one ...
So what you should do is:
create (cplx) table of the func(x) data on the interval you want with iterations samples
so for loop where x = x0+i*(x1-x0)/(iterations-1) and x0,x1 is the range you want the func to sample. Lets call it f[i]
for (i=0;i<iteration;i++) f[i]=func(x0+i*(x1-x0)/(iterations-1));
furrier transform it
something like this:
for (i=0;i<iteration;i++) a[i]=b[i]=0;
for (j=0;j<iteration;j++)
for (i=0;i<iteration;i++)
{
a[j]+=f[i]*cos(-2.0*M_PI*i*j/iterations);
b[j]+=f[i]*sin(-2.0*M_PI*i*j/iterations);
}
now a[],b[] should hold your slow DFT result ... beware integer rounding ... depending on compiler you might need to cast some stuff to double to avoid integer rounding.
I am trying to replicate Matlab's Fsolve as my project is in C++ solving an implicit RK4 scheme. I am using the NLopt library using the NLOPT_LD_MMA algorithm. I have run the required section in matlab and it is considerably faster. I was wondering whether anyone had any ideas of a better Fsolve equivalent in C++? Another reason is that I would like f1 and f2 to both tend to zero and it seems suboptimal to calculate the L2 norm to include both of them as NLopt seems to only allow a scalar return value from the objective function. Does anyone have any ideas of an alternative library or perhaps using a different algorithm/constraints to more closely replicate the default fsolve.
Would it be better (faster) perhaps to call the python scipy.minimise.fsolve from C++?
double implicitRK4(double time, double V, double dt, double I, double O, double C, double R){
const int number_of_parameters = 2;
double lb[number_of_parameters];
double ub[number_of_parameters];
lb[0] = -999; // k1 lb
lb[1] = -999;// k2 lb
ub[0] = 999; // k1 ub
ub[1] = 999; // k2 ub
double k [number_of_parameters];
k[0] = 0.01;
k[1] = 0.01;
kOptData addData(time,V,dt,I,O,C,R);
nlopt_opt opt; //NLOPT_LN_MMA NLOPT_LN_COBYLA
opt = nlopt_create(NLOPT_LD_MMA, number_of_parameters);
nlopt_set_lower_bounds(opt, lb);
nlopt_set_upper_bounds(opt, ub);
nlopt_result nlopt_remove_inequality_constraints(nlopt_opt opt);
// nlopt_result nlopt_remove_equality_constraints(nlopt_opt opt);
nlopt_set_min_objective(opt,solveKs,&addData);
double minf;
if (nlopt_optimize(opt, k, &minf) < 0) {
printf("nlopt failed!\n");
}
else {
printf("found minimum at f(%g,%g,%g) = %0.10g\n", k[0],k[1],minf);
}
nlopt_destroy(opt);
return V + (1/2)*dt*k[0] + (1/2)*dt*k[1];```
double solveKs(unsigned n, const double *x, double *grad, void *my_func_data){
kOptData *unpackdata = (kOptData*) my_func_data;
double t1,y1,t2,y2;
double f1,f2;
t1 = unpackdata->time + ((1/2)-(1/6)*sqrt(3));
y1 = unpackdata->V + (1/4)*unpackdata->dt*x[0] + ((1/4)-(1/6)*sqrt(3))*unpackdata->dt*x[1];
t2 = unpackdata->time + ((1/2)+(1/6)*sqrt(3));
y2 = unpackdata->V + ((1/4)+(1/6)*sqrt(3))*unpackdata->dt*x[0] + (1/4)*unpackdata->dt*x[1];
f1 = x[0] - stateDeriv_implicit(t1,y1,unpackdata->dt,unpackdata->I,unpackdata->O,unpackdata->C,unpackdata->R);
f2 = x[1] - stateDeriv_implicit(t2,y2,unpackdata->dt,unpackdata->I,unpackdata->O,unpackdata->C,unpackdata->R);
return sqrt(pow(f1,2) + pow(f2,2));
My matlab version below seems to be a lot simpler but I would prefer the whole code in c++!
k1 = 0.01;
k2 = 0.01;
x0 = [k1,k2];
fun = #(x)solveKs(x,t,z,h,I,OCV1,Cap,Rct,static);
options = optimoptions('fsolve','Display','none');
k = fsolve(fun,x0,options);
% Calculate the next state vector from the previous one using RungeKutta
% update equation
znext = z + (1/2)*h*k(1) + (1/2)*h*k(2);``
function [F] = solveKs(x,t,z,h,I,O,C,R,static)
t1 = t + ((1/2)-(1/6)*sqrt(3));
y1 = z + (1/4)*h*x(1) + ((1/4)-(1/6)*sqrt(3))*h *x(2);
t2 = t + ((1/2)+(1/6)*sqrt(3));
y2 = z + ((1/4)+(1/6)*sqrt(3))*h*x(1) + (1/4)*h*x(2);
F(1) = x(1) - stateDeriv_implicit(t1,y1,h,I,O,C,R,static);
F(2) = x(2) - stateDeriv_implicit(t2,y2,h,I,O,C,R,static);
end
I am developing a package in R that I would like to convert to Rcpp for better performance. I'm new to Rcpp (and C++ in general.) My problem is that the Rcpp function I've written works fine if I run it many times with one set of arguments, but if I try to loop it over many combinations of arguments, it springs memory leaks and causes the R session to abort.
Here is the code in R, which holds up well to any test I throw at it:
raw_noise <- function(timesteps, mu, sigma, phi) {
delta <- mu * (1 - phi)
variance <- sigma^2 * (1 - phi^2)
noise <- vector(mode = "double", length = timesteps)
noise[1] <- c(rnorm(1, mu, sigma))
for (i in (1:(timesteps - 1))) {
noise[i + 1] <- delta + phi * noise[i] + rnorm(1, 0, sqrt(variance))
}
return(noise)
}
Here is the code in Rcpp, using three Rcpp sugar functions (pow, sqrt, rnorm):
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
for(int i = 0; i < timesteps; ++i) {
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}
What really confuses me is that this code runs without problems:
library(purrr)
rerun(10000, raw_noise(timesteps = 30, mu = 0.5, sigma = 0.2, phi = 0.3))
But when I run this code:
test_loop <- function(timesteps, mu, sigma, phi, replicates) {
params <- cross_df(list(timesteps = timesteps, phi = phi, mu = mu, sigma =
sigma))
for (i in 1:nrow(params)) {
print(params[i,])
pmap(params[i,], raw_noise)
}
}
library(purrr)
test_loop(timesteps=c(5, 6, 7, 8, 9, 10), mu=c(0.2, 0.5), sigma=c(0.2, 0.5),
phi=c(0, 0.1))
More often than not, the R session aborts and RStudio crashes altogether. But sometimes I manage to catch this error message before the R session aborts:
Error in match(x, table, nomatch = 0L) : GC encountered a node
(0x10db7af50) with an unknown SEXP type: NEWSXP at memory.c:1692
As I understand it, NEWSXP is an exotic object type in R that doesn't come up very often. What's happening looks to me like a memory leak, but I'm not at all sure how to fix it. Like I said, I'm new to Rcpp and C++ generally so I'd appreciate any nudges in the right direction.
You have an out of bounds error:
for(int i = 0; i < timesteps; ++i)
causes
noise[i+1]
to exceed the defined range since C++ indices start at 0 and not 1.
For example, 0 to timesteps - 1 has a length of timesteps and, thus, is okay.
but
0 to timesteps would have a length of timesteps + 1
This can be seen if you change noise[i+1] to noise(i+1), which performs a bounds check on the requested index.
Error in raw_noise(100, 2, 3, 0.2) :
Index out of bounds: [index=100; extent=100].
To address this, make the following change:
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
// change here
for(int i = 0; i < timesteps - 1; ++i) { // 1 less time step
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}
I'm trying to implement batch grandient descent algorithm for my machine learning homework. I have a training set, whose x value is around 10^3 and y value is around 10^6. I'm trying to find the value of [theta0, theta1] which makes y = theta0 + theta1 * x converge. I set the learning rate to 0.0001 and maximum interation to 10. Here's my code in Qt.
QVector<double> gradient_descent_batch(QVector<double> x, QVector<double>y)
{
QVector<double> theta(0);
theta.resize(2);
int size = x.size();
theta[1] = 0.1;
theta[0] = 0.1;
for (int j=0;j<MAX_ITERATION;j++)
{
double dJ0 = 0.0;
double dJ1 = 0.0;
for (int i=0;i<size;i++)
{
dJ0 += (theta[0] + theta[1] * x[i] - y[i]);
dJ1 += (theta[0] + theta[1] * x[i] - y[i]) * x[i];
}
double theta0 = theta[0];
double theta1 = theta[1];
theta[0] = theta0 - LRATE * dJ0;
theta[1] = theta1 - LRATE * dJ1;
if (qAbs(theta0 - theta[0]) < THRESHOLD && qAbs(theta1 - theta[1]) < THRESHOLD)
return theta;
}
return theta;
}
I print the value of theta every interation, and here's the result.
QVector(921495, 2.29367e+09)
QVector(-8.14503e+12, -1.99708e+16)
QVector(7.09179e+19, 1.73884e+23)
QVector(-6.17475e+26, -1.51399e+30)
QVector(5.3763e+33, 1.31821e+37)
QVector(-4.68109e+40, -1.14775e+44)
QVector(4.07577e+47, 9.99338e+50)
QVector(-3.54873e+54, -8.70114e+57)
QVector(3.08985e+61, 7.57599e+64)
QVector(-2.6903e+68, -6.59634e+71)
I seems that theta will never converge.
I follow the solution here to set learning rate to 0.00000000000001 and maximum iteration to 20. But it seems will not converge. Here's the result.
QVector(0.100092, 0.329367)
QVector(0.100184, 0.558535)
QVector(0.100276, 0.787503)
QVector(0.100368, 1.01627)
QVector(0.10046, 1.24484)
QVector(0.100552, 1.47321)
QVector(0.100643, 1.70138)
QVector(0.100735, 1.92936)
QVector(0.100826, 2.15713)
QVector(0.100918, 2.38471)
QVector(0.101009, 2.61209)
QVector(0.1011, 2.83927)
QVector(0.101192, 3.06625)
QVector(0.101283, 3.29303)
QVector(0.101374, 3.51962)
QVector(0.101465, 3.74601)
QVector(0.101556, 3.9722)
QVector(0.101646, 4.1982)
QVector(0.101737, 4.424)
QVector(0.101828, 4.6496)
What's wrong?
So firstly your algorithm seems fine except that you should divide LRATE by size;
theta[0] = theta0 - LRATE * dJ0 / size;
theta[1] = theta1 - LRATE * dJ1 / size;
What I would suggest you should calculate cost function and monitor it;
Cost function
Your cost should be decreasing on every iteration. If its bouncing back and forward you are using a large value of learning rate. I would suggest you to use 0.01 and do 400 iterations.
I'm working on a school project , I have to optimize part of code in SSE, but I'm stuck on one part for few days now.
I dont see any smart way of using vector SSE instructions(inline assembler / instric f) in this code(its a part of guassian blur algorithm). I would be glad if somebody could give me just a small hint
for (int x = x_start; x < x_end; ++x) // vertical blur...
{
float sum = image[x + (y_start - radius - 1)*image_w];
float dif = -sum;
for (int y = y_start - 2*radius - 1; y < y_end; ++y)
{ // inner vertical Radius loop
float p = (float)image[x + (y + radius)*image_w]; // next pixel
buffer[y + radius] = p; // buffer pixel
sum += dif + fRadius*p;
dif += p; // accumulate pixel blur
if (y >= y_start)
{
float s = 0, w = 0; // border blur correction
sum -= buffer[y - radius - 1]*fRadius; // addition for fraction blur
dif += buffer[y - radius] - 2*buffer[y]; // sum up differences: +1, -2, +1
// cut off accumulated blur area of pixel beyond the border
// assume: added pixel values beyond border = value at border
p = (float)(radius - y); // top part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[0]*p;
w += p;
}
p = (float)(y + radius - image_h + 1); // bottom part to cut off
if (p > 0)
{
p = p*(p-1)/2 + fRadius*p;
s += buffer[image_h - 1]*p;
w += p;
}
new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
}
else if (y + radius >= y_start)
{
dif -= 2*buffer[y];
}
} // for y
} // for x
One more feature you can use is logical operations and masks:
for example instead of:
// process only 1
if (p > 0)
p = p*(p-1)/2 + fRadius*p;
you can write
// processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 ¬Mask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask
Also I can recommend you to use a special libraries, which overloads instructions. For example: http://code.compeng.uni-frankfurt.de/projects/vc
dif variable makes iterations of inner loop dependent. You should try to parallelize the outer loop. But with out instructions overloading the code will become unmanageable then.
Also consider rethinking the whole algorithm. Current one doesn't look paralell. May be you can neglect precision, or increase scalar time a bit?