Gradient descent algorithm won't converge - c++

I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8) to the training set.
I'm inputting a number X, and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\] and b=0=theta\[0\].
The training set is the array x and y, where (x[i],y[i]) is a point.
void train()
{
double delta;
for (int i = 0; i < x.size(); i++)
{
delta = y[i]-hypothesis(x[i]);
theta[1] += alpha*delta*x[i];
theta[0] += alpha*delta*1;
}
}
void C_Approx::display()
{
std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}
some of the results I'm getting:
I input a number, it runs train() a few times, then display()
1
0.33616x + 0.33616 f(x)=0.67232
1
0.482408x + 0.482408 f(x)=0.964816
1
0.499381x + 0.499381 f(x)=0.998762
1
0.499993x + 0.499993 f(x)=0.999986
1
0.5x + 0.5 f(x)=1
An example of it diverging after it passed 8:
1
0.33616x + 0.33616 f(x)=0.67232
2
0.705508x + 0.509914 f(x)=1.21542
3
0.850024x + 0.449928 f(x)=1.29995
4
0.936062x + 0.330346 f(x)=1.26641
5
0.951346x + 0.231295 f(x)=1.18264
6
0.992876x + 0.137739 f(x)=1.13062
7
0.932206x + 0.127372 f(x)=1.05958
8
1.00077x + 0.000493063 f(x)=1.00126
9
-0.689325x + -0.0714712 f(x)=-0.760797
10
4.10321e+08x + 4.365e+07 f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21 f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40 f(x)=-4.27216e+41
I tried the solution proposed here of scaling the step and ended up with similar results.
What am I doing wrong?

Your implementation is good. Generally, stochastic gradient descent might diverge when α is too large. What you would do with a large dataset is take a reasonably sized random sample, find α that gives you the best results, and then use it for the rest.

I have experienced the same problem (albeit in Java) because my learning rate was too big.
For short, I was using α = 0.001 and I had to push it to 0.000001 to see actual convergence.
Of course these values are linked to your dataset.

When your cost function increases or cycles up and down, you usually have too large a value for alpha. What alpha are you using?
Start out with an alpha = 0.001 and see if that converges? If not try various alphas (0.003, 0.01, 0.03, 0.1, 0.3, 1) and find one that converges quickly.
Scaling the data (normalization) won't help you with only 1 feature (your theta[1]) as normalization only applies to 2+ features (multivariate linear regression).
Also bear in mind that for a small number of features you can use the Normal Equation to get the correct answer.

use backtracking line search to guaranty convergence. It is very simple to implement. See Stephen Boyd, Convex Optimization for reference. You can choose some standard alpha, beta values for backtracking line search, for example 0.3 and 0.8.

If I understand you correctly, your training set only has a non-zero gradient at the edge of a line? Unless you start at the line (actually start exactly at one of your training points) you won't find the line. You are always at a local minimum.

It's not clean from your description what problem you're solving.
Also it's very dangerous to post links to external resources - you can be blocked in stackoverflow.
In any case - gradient descend method and (subgradient descend too) with fixed step size (ML community call it learning rate) should not necesseray converge.
p.s.
Machine Learning community is not interesting in "convergence condition" and "convergence to what" - they are interested in create "something" which pass cross-validation with good result.
If you're curious about optimization - start to look in convex optimization. Unfortunately it's hard to find job on it, but it append clean vision into what happens in various math optimization things.
Here is source code which demonstrate it for simple quadratic objective:
#!/usr/bin/env python
# Gradiend descend method (without stepping) is not converged for convex
# objective
alpha = 0.1
#k = 10.0 # jumping around minimum
k = 20.0 # diverge
#k = 0.001 # algorithm converged but gap to the optimal is big
def f(x): return k*x*x
def g(x): return 2*k*x
x0 = 12
xNext = x0
i = 0
threshold = 0.01
while True:
i += 1
xNext = xNext + alpha*(-1)*(g(xNext))
obj = (xNext)
print "Iteration: %i, Iterate: %f, Objective: %f, Optimality Gap: %f" % (i, xNext, obj, obj - f(0.0))
if (abs(g(xNext)) < threshold):
break
if i > 50:
break
print "\nYou launched application with x0=%f,threshold=%f" % (x0, threshold)

Related

Using gradient descent to solve a nonlinear system

I have the following code, which uses gradient descent to find the global minimum of y = (x+5)^2:
cur_x = 3 # the algorithm starts at x=3
rate = 0.01 # learning rate
precision = 0.000001 # this tells us when to stop the algorithm
previous_step_size = 1
max_iters = 10000 # maximum number of iterations
iters = 0 # iteration counter
df = lambda x: 2*(x+5) # gradient of our function
while previous_step_size > precision and iters < max_iters:
prev_x = cur_x # store current x value in prev_x
cur_x = cur_x - rate * df(prev_x) # grad descent
previous_step_size = abs(cur_x - prev_x) # change in x
iters = iters+1 # iteration count
print("Iteration",iters,"\nX value is",cur_x) # print iterations
print("The local minimum occurs at", cur_x)
The procedure is fairly simple, and among the most intuitive and brief for solving such a problem (at least, that I'm aware of).
I'd now like to apply this to solving a system of nonlinear equations. Namely, I want to use this to solve the Time Difference of Arrival problem in three dimensions. That is, given the coordinates of 4 observers (or, in general, n+1 observers for an n dimensional solution), the velocity v of some signal, and the time of arrival at each observer, I want to reconstruct the source (determine it's coordinates [x,y,z].
I've already accomplished this using approximation search (see this excellent post on the matter: ), and I'd now like to try doing so with gradient descent (really, just as an interesting exercise). I know that the problem in two dimensions can be described by the following non-linear system:
sqrt{(x-x_1)^2+(y-y_1)^2}+s(t_2-t_1) = sqrt{(x-x_2)^2 + (y-y_2)^2}
sqrt{(x-x_2)^2+(y-y_2)^2}+s(t_3-t_2) = sqrt{(x-x_3)^2 + (y-y_3)^2}
sqrt{(x-x_3)^2+(y-y_3)^2}+s(t_1-t_3) = sqrt{(x-x_1)^2 + (y-y_1)^2}
I know that it can be done, however I cannot determine how.
How might I go about applying this to 3-dimensions, or some nonlinear system in general?

Calculation sine and cosine in one shot

I have a scientific code that uses both sine and cosine of the same argument (I basically need the complex exponential of that argument). I was wondering if it were possible to do this faster than calling sine and cosine functions separately.
Also I only need about 0.1% precision. So is there any way I can find the default trig functions and truncate the power series for speed?
One other thing I have in mind is, is there any way to perform the remainder operation such that the result is always positive? In my own algorithm I used x=fmod(x,2*pi); but then I would need to add 2pi if x is negative (smaller domain means I can use a shorter power series)
EDIT: LUT turned out to be the best approach for this, however I am glad I learned about other approximation techniques. I will also advise using an explicit midpoint approximation. This is what I ended up doing:
const int N = 10000;//about 3e-4 error for 1000//3e-5 for 10 000//3e-6 for 100 000
double *cs = new double[N];
double *sn = new double[N];
for(int i =0;i<N;i++){
double A= (i+0.5)*2*pi/N;
cs[i]=cos(A);
sn[i]=sin(A);
}
The following part approximates (midpoint) sincos(2*pi*(wc2+t[j]*(cotp*t[j]-wc)))
double A=(wc2+t[j]*(cotp*t[j]-wc));
int B =(int)N*(A-floor(A));
re += cs[B]*f[j];
im += sn[B]*f[j];
Another approach could have been using the chebyshev decomposition. You can use the orthogonality property to find the coefficients. Optimized for exponential, it looks like this:
double fastsin(double x){
x=x-floor(x/2/pi)*2*pi-pi;//this line can be improved, both inside this
//function and before you input it into the function
double x2 = x*x;
return (((0.00015025063885163012*x2-
0.008034350857376128)*x2+ 0.1659789684145034)*x2-0.9995812174943602)*x;} //7th order chebyshev approx
If you seek fast evaluation with good (but not high) accuracy with powerseries you should use an expansion in Chebyshev polynomials: tabulate the coefficients (you'll need VERY few for 0.1% accuracy) and evaluate the expansion with the recursion relations for these polynomials (it's really very easy).
References:
Tabulated coefficients: http://www.ams.org/mcom/1980-34-149/S0025-5718-1980-0551302-5/S0025-5718-1980-0551302-5.pdf
Evaluation of chebyshev expansion: https://en.wikipedia.org/wiki/Chebyshev_polynomials
You'll need to (a) get the "reduced" argument in the range -pi/2..+pi/2 and consequently then (b) handle the sign in your results when the argument actually should have been in the "other" half of the full elementary interval -pi..+pi. These aspects should not pose a major problem:
determine (and "remember" as an integer 1 or -1) the sign in the original angle and proceed with the absolute value.
use a modulo function to reduce to the interval 0..2PI
Determine (and "remember" as an integer 1 or -1) whether it is in the "second" half and, if so, subtract pi*3/2, otherwise subtract pi/2. Note: this effectively interchanges sine and cosine (apart from signs); take this into account in the final evaluation.
This completes the step to get an angle in -pi/2..+pi/2
After evaluating sine and cosine with the Cheb-expansions, apply the "flags" of steps 1 and 3 above to get the right signs in the values.
Just create a lookup table. The following will let you lookup the sin and cos of any radian value between -2PI and 2PI.
// LOOK UP TABLE
var LUT_SIN_COS = [];
var N = 14400;
var HALF_N = N >> 1;
var STEP = 4 * Math.PI / N;
var INV_STEP = 1 / STEP;
// BUILD LUT
for(var i=0, r = -2*Math.PI; i < N; i++, r += STEP) {
LUT_SIN_COS[2*i] = Math.sin(r);
LUT_SIN_COS[2*i + 1] = Math.cos(r);
}
You index into the lookup table by:
var index = ((r * INV_STEP) + HALF_N) << 1;
var sin = LUT_SIN_COS[index];
var cos = LUT_SIN_COS[index + 1];
Here's a fiddle that displays the % error you can expect from different sized LUTS http://jsfiddle.net/77h6tvhj/
EDIT Here's an ideone (c++) with a ~benchmark~ vs the float sin and cos. http://ideone.com/SGrFVG For whatever a benchmark on ideone.com is worth the LUT is 5 times faster.
One way to go would be to learn how to implement the CORDIC algorithm. It is not difficult and pretty interesting intelectually. This gives you both the cosine and the sine. Wikipedia gives a MATLAB example that should be easy to adapt in C++.
Note that you can augment speed and reduce precision simply by lowering the parameter n.
About your second question, it has already been asked here (in C). It seems that there is no simple way.
You can also calculate sine using a square root, given the angle and the cosine.
The example below assumes the angle ranges from 0 to 2π:
double c = cos(angle);
double s = sqrt(1.0-c*c);
if(angle>pi)s=-s;
For single-precision floats, Microsoft uses 11-degree polynomial approximation for sine, 10-degree for cosine: XMScalarSinCos.
They also have faster version, XMScalarSinCosEst, that uses lower-degree polynomials.
If you aren’t on Windows, you’ll find same code + coefficients on geometrictools.com under Boost license.

Program Help - Solving for e(n)

I've been wrestling with this issue for a week and I just need some guidance on the math part of it. If I could just understand the math behind it I could piece together the functions to make it work. The assignment is;
Design and develop a C++ program for Calculating e(n) when delta <= 0.000001
e(n-1) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n-1)!
e(n) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n)!
delta = e(n) – e(n-1)
You do not have any input to the program. Your output should be something like this:
N = 2 e(1) = 2 e(2) = 2.5 delta = 0.5
N = 3 e(2) = 2.5 e(3) = 2.565 delta = 0.065
...
You must use recursive function calls.
My first issue is the math and the variables that would contain them.
the delta, e(n), and e(n-1) variable must doubles
if e(n) = 1 + 1 / 1! = 2 then e(n-1) must equal 1, which means delta = 1 (that's my thinking anyway) I'm just not sure of the math behind the .5 delta the first time and the 0.065 in the second iteration.
Can someone point me in the right direction on this problem?
Thank you,
T
From the wikipedia link, you can see that
I will not explain the notion of limits here, but what this basically means is that, if we define a function e where e(n) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n)! (which is the function given in your problem), we are able to approximate the real value of the constant e.
The higher n is, the closer we get from e.
If you look closely at the function, you can see that each time, we add a term which is smaller than the previous one: 1 >= 1/1! >= 1/2! >= .... >= 1/(n)!
That basically means that, every time we increase n we are getting closer to e but we are slowing down in the way.
The real value of e is 2.71828...
In our first step e(1) = 1, we are 1.71828... too far from the real value
In the second step e(2) = 2, we are at 0.71828..., 1 distance closer
In the third step e(3) = 2.5, we are now at 0.21828..., 0.5 distance closer
As you can see, we are getting there, but the closer we get, the slower we move. Now let's say that at each step, we want to know how close we have moved compared to the previous value.
We then do simply e(n) - e(n-1). This is basically what the delta means.
At some point, we are moving so slow that it does no longer make any sense to keep going. We are almost staying put. At this point, we decide that our approximation is close enough from e.
In your case, the problem defines the minimum progression speed to 0.000001
here is a solution :-
delta = e(n) - e(n-1)
delta = 1/n!
delta < 0.000001
n! > 1000000
n >= 10 as 10! = 3628800

Quadrature routines for probability densities

I want to integrate a probability density function from (-\infty, a] because the cdf is not available in closed form. But I'm not sure how to do this in C++.
This task is pretty simple in Mathematica; All I need to do is define the function,
f[x_, lambda_, alpha_, beta_, mu_] :=
Module[{gamma},
gamma = Sqrt[alpha^2 - beta^2];
(gamma^(2*lambda)/((2*alpha)^(lambda - 1/2)*Sqrt[Pi]*Gamma[lambda]))*
Abs[x - mu]^(lambda - 1/2)*
BesselK[lambda - 1/2, alpha Abs[x - mu]] E^(beta (x - mu))
];
and then call the NIntegrate Routine to numerically integrate it.
F[x_, lambda_, alpha_, beta_, mu_] :=
NIntegrate[f[t, lambda, alpha, beta, mu], {t, -\[Infinity], x}]
Now I want to achieve the same thing in C++. I using the routine gsl_integration_qagil from the gsl numerics library. It is designed to integrate functions on the semi infinite intervals (-\infty, a] which is just what I want. But unfortunately I can't get it to work.
This is the density function in C++,
density(double x)
{
using namespace boost::math;
if(x == _mu)
return std::numeric_limits<double>::infinity();
return pow(_gamma, 2*_lambda)/(pow(2*_alpha, _lambda-0.5)*sqrt(_pi)*tgamma(_lambda))* pow(abs(x-_mu), _lambda - 0.5) * cyl_bessel_k(_lambda-0.5, _alpha*abs(x - _mu)) * exp(_beta*(x - _mu));
}
Then I try and integrate to obtain the cdf by calling the gsl routine.
cdf(double x)
{
gsl_integration_workspace * w = gsl_integration_workspace_alloc (1000);
double result, error;
gsl_function F;
F.function = &density;
double epsabs = 0;
double epsrel = 1e-12;
gsl_integration_qagil (&F, x, epsabs, epsrel, 1000, w, &result, &error);
printf("result = % .18f\n", result);
printf ("estimated error = % .18f\n", error);
printf ("intervals = %d\n", w->size);
gsl_integration_workspace_free (w);
return result;
}
However gsl_integration_qagil returns an error, number of iterations was insufficient.
double mu = 0.0f;
double lambda = 3.0f;
double alpha = 265.0f;
double beta = -5.0f;
cout << cdf(0.01) << endl;
If I increase the size of the workspace then the bessel function will not evaluate.
I was wondering if there was anyone that could give me any insight to my problem. A call to the corresponding Mathematica function F above with x = 0.01 returns 0.904384.
Could it be that the density is concentrated around a very small interval (i.e. outside of [-0.05, 0.05] the density is almost 0, a plot is given below). If so what can be done about this. Thanks for reading.
Re: integrating to +/- infinity:
I would use Mathematica to find an empirical bound for |x - μ| >> K, where K represents the "width" around the mean, and K is a function of alpha, beta, and lambda -- for example F is less than and approximately equal to a(x-μ)-2 or ae-b(x-μ)2 or whatever. These functions have known integrals out to infinity, for which you can evaluate empirically. Then you can integrate numerically out to K, and use the bounded approximation to get from K to infinity.
Figuring out K may be a bit tricky; I'm not very familiar with Bessel functions so I can't help you much there.
In general, I've found that for numerical calculation that's not obvious, the best way is to do as much analytical math as you can before you do numerical evaluation. (Kind of like an autofocus camera -- get it close to where you want, then let the camera do the rest.)
I haven't tried the C++ code, but by checking out the function in Mathematica, it does seem extremely peaked around mu, with the spread of the peak determined by the parameters lambda,alpha,beta.
What I would do would be to do a preliminary search of the pdf: look to the right and to the left of x=mu until you find the first value below a given tolerance. Use these as the bounds for your cdf, instead of negative infinity.
Pseudo code follows:
x_mu
step = 0.000001
adaptive_step(y_value) -> returns a small step size if close to 0, and larger if far.
while (pdf_current > tolerance):
step = adaptive_step(pdf_current)
xtest = xtest - step
pdf_current = pdf(xtest)
left_bound = xtest
//repeat for left bound
Given how tightly peaked this function seems to be, tightening the bounds would probably save you a lot of computer time that's currently wasted calculating zeros. Also, you'd be able to use the bounded integration routine, rather than -\infty,b .
Just a thought...
PS: Mathematica gives me F[0.01, 3, 265, -5, 0] = 0.884505
I found a complete description on this glsl there http://linux.math.tifr.res.in/manuals/html/gsl-ref-html/gsl-ref_16.html, you may find usefull informations.
Since I'm not GSL expert I did not focus on your problem from the math point of view, but rather I've to remind you some key aspect about floating point programming.
You can't accurately represent numbers using IEEE 754 standard. MathLab do hide the fact by using an infinite number representation logic, in order to give you rouding error-free results , this is the reason why it's slow compared to native code.
I strongly recommand this link for anyone involved in scientific calculus using a FPU:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Assuming you've enjoyed that article, I've noticed this on the GSL link above: "The routines will fail to converge if the error bounds are too stringent".
Your bounds may be too stringent if the difference between the upper and the lower is less than the minimum representable value of double, that is
std::numeric_limits::epsilon();.
In addition remember, from the 2nd link, for any C/C++ compiler implementation the default rounding mode is "truncate", this introduce subtle calculus errors leeding to the wrong results. I did have the problem with a simple Liang Barsky line clipper, 1st order ! So imagine the mess in this line:
return pow(_gamma, 2*_lambda)/(pow(2*_alpha, _lambda-0.5)*sqrt(_pi)*tgamma(_lambda))* pow(abs(x-_mu), _lambda - 0.5) * cyl_bessel_k(_lambda-0.5, _alpha*abs(x - _mu)) * exp(_beta*(x - _mu));
As a general rule, it is wise in C/C++, to add additional variable holding intermediate results, so you can debug step by step, then see any rounding error, you shouldn't try to input expression like this one in any native programing langage. One can't optimize variables better than a compiler.
Finally, as a general rule, you should multiply everything then divide, unless you are confident about the dynamic behavior of your calculus.
Good luck.

2D Discrete laplacian (del2) in C++

I am trying to figure out how to port the del2() function in matlab to C++.
I have a couple of masks that I am working with that are ones and zeros, so I wrote code liket his:
for(size_t i = 1 ; i < nmax-1 ; i++)
{
for(size_t j = 1 ; j < nmax-1 ; j++)
{
transmask[i*nmax+j] = .25*(posmask[(i+1)*nmax + j]+posmask[(i-1)*nmax+j]+posmask[i*nmax+(j+1)]+posmask[i*nmax+(j-1)]);
}
}
to compute the interior points of the laplacians. I think according to some info in "doc del2" in matlab, the border conditions just use the available info to compute, right? SO i guess I just need to write cases for the border conditions at i,j = 0 and nmax
However, i would think these values from the code I have posted here would be correct for the interior points as is, but it seems like the del2 results are different!
I dug through the del2 source, and I guess I am not enough of a matlab wizard to figure out what is going on with some of the code for the interior computation
You can see the code of del2 by edit del2 or type del2.
Note that del2 does cubic interpolation on the boundaries.
The problem is that the line you have there:
transmask[i*nmax+j] = .25*(posmask[(i+1)*nmax + j]+posmask[(i-1)*nmax+j]+posmask[i*nmax+(j+1)]+posmask[i*nmax+(j-1)]);
isn't the discrete Laplacian at all.
What you have is (I(i+1,j) + I(i-1,j) + I(i,j+1) + I(i,j-1) ) / 4
I dont' know what this mask is, but the discrete Laplacian (assuming the spacing between each pixel in each dimension is 1) is:
(-4 * I(i,j) + I(i+1,j) + I(i-1,j) + I(i,j+1) + I(i,j-1) )
So basically, you missed a term, and you don't need to divide by 4. I suggest going back and rederiving the discrete Laplacian from its definition, which is the second x derivative of the image plus the second y derivative of the image.
Edit: I see where you got the /4 from, as Matlab uses this definition for some reason (even though this isn't standard mathematically).
I think that with the Matlab compiler you can convert the m code into C code. Have you tried that?
I found this link where another methot to convert to C is explained.
http://www.kluid.com/mlib/viewtopic.php?t=337
Good luck.