can anyone look over some simple gradient descent code? - c++

I'm trying to implement a very simple 1-dimensional gradient descent algorithm. The code I have does not work at all. Basically depending on my alpha value, the end parameters will either be wildly huge (like ~70 digits), or basically zero (~ 0.000). I feel like a gradient descent should not be nearly this sensitive in alpha (I'm generating small data in [0.0,1.0], but I think the gradient itself should account for the scale of the data, no?).
Here's the code:
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <vector>
using namespace std;
double a, b;
double theta0 = 0.0, theta1 = 0.0;
double myrand() {
return double(rand()) / RAND_MAX;
double f(double x) {
double y = a * x + b;
y *= 0.1 * (myrand() - 0.5); // +/- 5% noise
return y;
double h(double x) {
return theta1 * x + theta0;
int main() {
a = myrand();
b = myrand();
printf("set parameters: a = %lf, b = %lf\n", a, b);
int N = 100;
vector<double> xs(N);
vector<double> ys(N);
for (int i = 0; i < N; ++i) {
xs[i] = myrand();
ys[i] = f(xs[i]);
double sensitivity = 0.008;
double d0, d1;
for (int n = 0; n < 100; ++n) {
d0 = d1 = 0.0;
for (int i = 0; i < N; ++i) {
d0 += h(xs[i]) - ys[i];
d1 += (h(xs[i]) - ys[i]) * xs[i];
theta0 -= sensitivity * d0;
theta1 -= sensitivity * d1;
printf("theta0: %lf, theta1: %lf\n", theta0, theta1);
return 0;

Changing the value of alpha can produce the algorithm to diverge, so that may be one of the causes of what is happening. You can check by computing the error in each iteration and see if is increasing or decreasing.
In adition, it is recommended to set randomly the values of theta at the beginning in stead of assigning them to zero.
Apart from that, you should divide by N when you update the value of theta as follows:
theta0 -= sensitivity * d0/N;
theta1 -= sensitivity * d1/N;

I had a quick look at your implementation and it looks fine to me.
The code I have does not work at all.
I wouldn't say that. It seems to behave correctly for small enough values of sensitivity, which is a value that you just have to "guess", and that is how the gradient descent is supposed to work.
I feel like a gradient descent should not be nearly this sensitive in alpha
If you struggle to visualize that, remember that you are using gradient descent to find the minimum of the cost function of linear regression, which is a quadratic function. If you plot the cost function you will see why the learning rate is so sensitive in these cases: intuitively, if the parabola is narrow, the algorithm will converge more quickly, which is good, but then the learning rate is more "sensitive" and the algorithm can easily diverge if you are not careful.


Ineffective "Peel/Remainder" Loop in my code

I have this function:
bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
bool ret = false;
// input size (-1 for the safe bilinear interpolation)
const int width = im.cols-1;
const int height = im.rows-1;
// output size
const int halfWidth = res.cols >> 1;
const int halfHeight = res.rows >> 1;
float *out = res.ptr<float>(0);
const float *imptr = im.ptr<float>(0);
for (int j=-halfHeight; j<=halfHeight; ++j)
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i, out++)
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
// compute weights
wx -= x; wy -= y;
int rowOffset = y*im.cols;
int rowOffset1 = (y+1)*im.cols;
// bilinear interpolation
*out =
(1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x] + wx * imptr[rowOffset+x+1]) +
( wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
} else {
*out = 0;
ret = true; // touching boundary of the input
return ret;
halfWidth is very random: it can be 9, 84, 20, 95, 111...I'm only trying to optimize this code, I don't understand it in details.
As you can see, the inner for has been already vectorized, but Intel Advisor suggests this:
And this is the Trip Count analysis result:
To my understand this means that:
Vector length is 8, so it means that 8 floats can be processed at the same time for each loop. This would mean (if I'm not wrong) that data are 32 bytes aligned (even though as I explain here it seems that the compiler think that data is not aligned).
On average, 2 cycles are totally vectorized, while 3 cycles are remainder loops. The same goes for Min and Max. Otherwise I don't understand what ; means.
Now my question is: how can I follow Intel Advisor first suggestion? It says to "increase the size of objects and add iterations so the trip count is a multiple of vector length"...Ok, so it's simply sayin' "hey man do this so halfWidth*2+1 (since it goes from -halfWidth to +halfWidth is a multiple of 8)". But how can I do this? If I add random cycles, this would obviously break the algorithm!
The only solution that came to my mind is to add "fake" iterations like this:
const int vectorLength = 8;
const int iterations = halfWidth*2+1;
const int remainder = iterations%vectorLength;
for(int i=0; i<loop+length-remainder; i++){
//this iteration was not supposed to exist, skip it!
Of course this code would not work since it goes from -halfWidth to halfWidth, but it's to make you understand my strategy of "fake" iterations.
About the second option ("Increase the size of static and automatic objects, and use a compiler option to add data padding") I have no idea how to implement this.
First, you have to check Vector Advisor Efficiency metric as well as relative time spent in Loop Remainder compared to Loop Body (see hotspots list in advisor). If efficiency is close to 100% (or time spent in Remainder is very small), then it is not worth effort (and money as MSalters mentioned in comments).
If it is << 100% (and there are no other penalties reported by the tool), then you can either refactor the code to "add fake iterations" (rare users can afford it) or you should try #pragma loop_count for most typical #iterations values (depending on typical halfWidth value).
If halfWIdth is totally random (no common or average values), then there is nothing you can really do with this issue.

simulated annealing algorithm

I implemented simulated annealing in C++ to minimize (x-2)^2+(y-1)^2 in some range.
I'm getting varied output which is not acceptable for this type of heuristic method. It seems that the solution is converging but never quite closing in on the solution.
My code:
#include <bits/stdc++.h>
using namespace std;
double func(double x, double y)
return (pow(x-2, 2)+pow(y-1, 2));
double accept(double z, double minim, double T,double d)
double p = -(z - minim) / (d * T);
return pow(exp(1), p);
double fRand(double fMin, double fMax)
double f = (double)rand() / RAND_MAX;
return fMin + f * (fMax - fMin);
int main()
srand (time(NULL));
double x = fRand(-30,30);
double y = fRand(-30,30);
double xm = x, ym=y;
double tI = 100000;
double tF = 0.000001;
double a = 0.99;
double d=(1.6*(pow(10,-23)));
double T = tI;
double minim = func(x, y);
double z;
double counter=0;
while (T>tF) {
int i=1;
while(i<=30) {
if (z<minim || (accept(z,minim,T,d)>(fRand(0,1)))) {
cout<<"min: "<<minim<<" x: "<<xm<<" y: "<<ym<<endl;
return 0;
How can I get it to reach the solution?
There are a couple of things that I think are wrong in your implementation of the simulated annealing algorithm.
At every iteration you should look at some neighbours z of current minimum and update it if f(z) < minimum. If f(z) > minimum you can also accept the new point, but with an acceptance probability function.
The problem is that in your accept function, the parameter d is way too low - it will always return 0.0 and never trigger the condition of acceptance. Try something like 1e-5; it doesn't have to be physically correct, it only has to decrease while lowering the "temperature".
After updating the temperature in the outer loop, you should put x=xm and y=ym, before doing the inner loop or instead of searching the neigbours of the current solution you will basically randomly wander around (you aren't checking any boundaries too).
Doing so, I usually get some output like this:
min: 8.25518e-05 x: 2.0082 y: 0.996092
Hope it helped.

why are Cubic Bezier functions not accurate compared too windows api PolyBezier?

I have been trying to find a way to draw a curved line/cubic bezier line using a custom function. However, all the examples and such found on the internet, differ a little from each other and usually produce different results, why? . None of the ones i have tried produce the same result as windows api PolyBezier which is what i need.
This is my current code for drawing cubic bezier lines:
double Factorial(int number)
double factorial = 1;
if (number > 1)
for (int count = 1; count <= number; count++) factorial = factorial * count;
return factorial;
double choose(double a, double b)
return Factorial(a) / (Factorial(b) * Factorial(a - b));
VOID MyPolyBezier(HDC hdc, PPOINT Pts, int Total)
float x, y;
MoveToEx(hdc, Pts[0].x, Pts[0].y, 0);
Total -= 1;
//for (float t = 0; t <= 1; t += (1./128.))
for (float t = 0; t <= 1; t += 0.0078125)
x = 0;
y = 0;
for (int I = 0; I <= Total; I++)
x += Pts[I].x * choose(Total, I) * pow(1 - t, Total - I) * pow(t, I);
y += Pts[I].y * choose(Total, I) * pow(1 - t, Total - I) * pow(t, I);
LineTo(hdc, x, y);
And here is the code for testing it.
POINT TestPts[4];
//set x, y points for the curved line.
TestPts[0].x = 50;
TestPts[0].y = 200;
TestPts[1].x = 100;
TestPts[1].y = 100;
TestPts[2].x = 150;
TestPts[2].y = 200;
TestPts[3].x = 200;
TestPts[3].y = 200;
//Draw using custom function.
MyPolyBezier(hdc, TestPts, 4);
//Move the curve down some.
TestPts[0].y += 10;
TestPts[1].y += 10;
TestPts[2].y += 10;
TestPts[3].y += 10;
//Draw using windows api.
//PolyDraw(hdc, TestPts, TestType, 4); //PolyDraw gives the same result as PolyBezier.
PolyBezier(hdc, TestPts, 4);
And an attached image of my bad results:
Note: the bottom bezier line is windows(PolyBezier) version.
the final goal, Windows(On the left) VS custom funtion. Hopefully this helps in some way.
So a cubic bezier is a mathematical curve. The cubic bezier is a specific case of a more general curve.
The cubic bezier is defined by 4 control points -- a start and end point, and 2 control points. In general, a bezier has n control points in order.
The line is drawn as a time parameter t goes from 0 to 1.
To find out where a general bezier of degree n is at time t:
For each adjacent pair of control points in your bezier, find the weighted average of them, as controlled by t. So at + b(1-t) for control points a before b.
Use these n-1 points to form a degree n-1 bezier.
Solve the new bezier at time t.
when you hit a degree 1 bezier, stop. That is your point.
Try writing an algorithm based off the true definition of bezier, and see where it differs from the windows curve. This may ne less frustrating than taking some approximation and having two sets of errors to reconcile.

Need help optimizing code (minimum image convention)

I have written some simulation code and am using the "randomly break in GDB" method of debugging. I am finding that 99.9% of my program's time is spent in this routine (it's the minimum image convention):
inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) {
double xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
The optimizations I have performed so far (maybe not very significant ones):
Return the square of the distance instead of the square root
Inline it
Const what I can
No standard library bloat
Compiling with every g++ optimization flag I can think of
I am running out of things I can do with this. Maybe I could use floats instead of doubles but I would prefer that be a last resort. And maybe I could somehow use SIMD on this, but I've never done that so I imagine that's a lot of work. Any ideas?
First, you're not using the right algorithm. What if the two points are greater than boxWidth apart? Second, if you have multiple particles, calling a single function that does all of the distance calculations and places the results in an output buffer is going to be significantly more efficient. Inlining helps reduce some of this, but not all. Any of the precalculation -- like dividing the box length by 2 in your algorithm -- is going to be repeated when it doesn't need to be.
Here is some SIMD code to do the calculation. You need to compile with -msse4. Using -O3, on my machine (macbook pro, llvm-gcc-4.2), I get a speed up of about 2x. This does require using 32bit floats instead of double precision arithmetic.
SSE really isn't that complicated, it just looks terrible. e.g. instead of writing a*b, you have to write the clunky _mm_mul_ps(a,b).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>
// you can compile this code with -DDOUBLE to try using doubles vs. floats
// in the unoptimized code. The SSE code uses only floats.
#ifdef DOUBLE
typedef double real;
typedef float real;
static inline __m128 loadFloat3(const float const* value) {
// Load (x,y,z) into a SSE register, leaving the last entry
// set to zero.
__m128 x = _mm_load_ss(&value[0]);
__m128 y = _mm_load_ss(&value[1]);
__m128 z = _mm_load_ss(&value[2]);
__m128 xy = _mm_movelh_ps(x, y);
return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0));
int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth,
float* out, const int n_points) {
int i;
__m128 r1, r2, r12, s12, r12_2, s, box, invBox;
box = _mm_set1_ps(boxWidth);
invBox = _mm_div_ps(_mm_set1_ps(1.0f), box);
for (i = 0; i < n_points; i++) {
r1 = loadFloat3(position1);
r2 = loadFloat3(position1);
r12 = _mm_sub_ps(r1, r2);
s12 = _mm_mul_ps(r12, invBox);
s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT));
r12 = _mm_mul_ps(box, s12);
r12_2 = _mm_mul_ps(r12, r12);
// double horizontal add instruction accumulates the sum of
// all four elements into each of the elements
// (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w)
s = _mm_hadd_ps(r12_2, r12_2);
s = _mm_hadd_ps(s, s);
_mm_store_ss(out++, s);
position1 += 3;
position2 += 3;
return 1;
inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) {
real xhw, yhw, zhw, x, y, z;
xhw = boxWidth / 2.0;
yhw = xhw;
zhw = xhw;
x = position2[0] - position1[0];
if (x > xhw)
x -= boxWidth;
else if (x < -xhw)
x += boxWidth;
y = position2[1] - position1[1];
if (y > yhw)
y -= boxWidth;
else if (y < -yhw)
y += boxWidth;
z = position2[2] - position1[2];
if (z > zhw)
z -= boxWidth;
else if (z < -zhw)
z += boxWidth;
return x * x + y * y + z * z;
int main(void) {
real* position1;
real* position2;
real* output;
int n_runs = 10000000;
posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real));
posix_memalign((void**) &output, 16, n_runs*sizeof(real));
real boxWidth = 1.8;
real result = 0;
int i;
clock_t t;
#ifdef OPT
printf("Timing optimized SSE implementation\n");
printf("Timinig original implementation\n");
#ifdef DOUBLE
printf("Using double precision\n");
printf("Using single precision\n");
t = clock();
#ifdef OPT
fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs);
for (i = 0; i < n_runs; i++) {
*output = distanceSqPeriodic(position1, position2, boxWidth);
position1 += 3;
position2 += 3;
t = clock() - t;
printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC);
you may want to use fabs (standarized in ISO 90 C) since this should be able to be reduced to a single non-branching instruction.
Return the square of the distance instead of the square root
That's a good idea as long as you are comparing squares to squares.
Inline it
This is sometimes a counter-optimization: Inlined code takes up space in the execution pipeline/cache, whether it is branched to or not.
Often it makes no difference because the compiler has the final word on whether to inline or not.
Const what I can
Normally no difference at all.
No standard library bloat
What bloat?
Compiling with every g++ optimization flag I can think of
That's good: Leave most optimizations to the compiler. Only if you measured your real bottleneck, and determined if that bottleneck is significant, invest money on hand optimizing.
What you could try do is to make your code branchfree. Without using bitmasks, this may look like this:
//if (z > zhw)
// z -= boxWidths[2];
//else if (z < -zhw)
// z += boxWidths[2];
const auto z_a[] = {
z - boxWidths[2]
z = z_a[z>zhw];
z -= (z>zhw) * boxWidths[2];
However, there is no guarantee that this is faster. Your compiler may now have a harder time identifying SIMD spots in your code, or the branch target buffer does a good job and most of the times you have the same code paths through your function.
You need to get rid of the comparisons, as those are hard to predict.
The function to be implemented is:
/ / /\ /\
/ / / \/ \
----0----- or ------------ , as (-x)^2 == x^2
/ /
/ /
The latter is a result of two abs statements:
x = abs(half-abs(diff))+half;
The code
double tst(double a[4], double b[4], double half)
double sum=0.0,t;
int i;
for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;}
return sum;
beats the original implementation by a factor of four (+some) -- and at this point there's not even full parallelism: only the lower half of xmm registers are used.
With parallel processing of x && y, there's a theoretical gain of about 50% to be achieved. Using floats instead of doubles could in theory make it still about 3x faster.

Optimization method for finding floating status of an object

The problem to solve is finding the floating status of a floating body, given its weight and the center of gravity.
The function i use calculates the displaced volume and center of bouyance of the body given sinkage, heel and trim.
Where sinkage is a length unit and heel/trim is an angle limited to a value from -90 to 90.
The floating status is found when displaced volum is equal to weight and the center of gravity is in a vertical line with center of bouancy.
I have this implemeted as a non-linear Newton-Raphson root finding problem with 3 variables (sinkage, trim, heel) and 3 equations.
This method works, but needs good initial guesses. So I am hoping to find either a better approach for this, or a good method to find the initial values.
Below is the code for the newton and jacobian algorithm used for the Newton-Raphson iteration. The function volume takes the parameters sinkage, heel and trim. And returns volume, and the coordinates for center of bouyancy.
I also included the maxabs and GSolve2 algorithms, I belive these are taken from Numerical Recipies.
void jacobian(float x[], float weight, float vcg, float tcg, float lcg, float jac[][3], float f0[]) {
float h = 0.0001f;
float temp;
float j_volume, j_vcb, j_lcb, j_tcb;
float f1[3];
volume(x[0], x[1], x[2], j_volume, j_lcb, j_vcb, j_tcb);
f0[0] = j_volume-weight;
f0[1] = j_tcb-tcg;
f0[2] = j_lcb-lcg;
for (int i=0;i<3;i++) {
temp = x[i];
x[i] = temp + h;
volume(x[0], x[1], x[2], j_volume, j_lcb, j_vcb, j_tcb);
f1[0] = j_volume-weight;
f1[1] = j_tcb-tcg;
f1[2] = j_lcb-lcg;
x[i] = temp;
jac[0][i] = (f1[0]-f0[0])/h;
jac[1][i] = (f1[1]-f0[1])/h;
jac[2][i] = (f1[2]-f0[2])/h;
void newton(float weight, float vcg, float tcg, float lcg, float &sinkage, float &heel, float &trim) {
float x[3] = {10,1,1};
float accuracy = 0.000001f;
int ntryes = 30;
int i = 0;
float jac[3][3];
float max;
float f0[3];
float gauss_f0[3];
while (i < ntryes) {
jacobian(x, weight, vcg, tcg, lcg, jac, f0);
if (sqrt((f0[0]*f0[0]+f0[1]*f0[1]+f0[2]*f0[2])/2) < accuracy) {
gauss_f0[0] = -f0[0];
gauss_f0[1] = -f0[1];
gauss_f0[2] = -f0[2];
GSolve2(jac, 3, gauss_f0);
x[0] = x[0]+gauss_f0[0];
x[1] = x[1]+gauss_f0[1];
x[2] = x[2]+gauss_f0[2];
// absmax(x) - Return absolute max value from an array
max = absmax(x);
if (max < 1) max = 1;
if (sqrt((gauss_f0[0]*gauss_f0[0]+gauss_f0[1]*gauss_f0[1]+gauss_f0[2]*gauss_f0[2])) < accuracy*max) {
sinkage = x[0];
heel = x[1];
trim = x[2];
int GSolve2(float a[][3],int n,float b[]) {
float x,sum,max,temp;
int i,j,k,p,m,pos;
int nn = n-1;
for (k=0;k<=n-1;k++)
/* pivot*/
for (p=k;p<n;p++){
if (max < fabs(a[p][k])){
if (ABS(a[k][pos]) < EPS) {
writeLog("Matrix is singular");
if (pos != k) {
/* convert to upper triangular form */
if ( fabs(a[k][k])>=1.e-6)
for (i=k+1;i<n;i++)
x = a[i][k]/a[k][k];
for (j=k+1;j<n;j++) a[i][j] = a[i][j] -a[k][j]*x;
b[i] = b[i] - b[k]*x;
writeLog("zero pivot found in line:%d",k);
return 0;
/* back substitution */
b[nn] = b[nn] / a[nn][nn];
for (i=n-2;i>=0;i--)
sum = b[i];
for (j=i+1;j<n;j++)
sum = sum - a[i][j]*b[j];
b[i] = sum/a[i][i];
return 0;
float absmax(float x[]) {
int i = 1;
int n = sizeof(x);
float max = x[0];
while (i < n) {
if (max < x[i]) {
max = x[i];
return max;
Have you considered some stochastic search methods to find the initial value and then fine-tuning with Newton Raphson? One possibility is evolutionary computation, you can use the Inspyred package. For a physical problem similar in many ways to the one you describe, look at this example:
What about using a damped version of Newton's method? You could quite easily modify your implementation to make it. Think about Newton's method as finding a direction
d_k = f(x_k) / f'(x_k)
and updating the variable
x_k+1 = x_k - L_k d_k
In the usual Newton's method, L_k is always 1, but this might create overshoots or undershoots. So, let your method chose L_k. Suppose that your method usually overshoots. A possible strategy consists in taking the largest L_k in the set {1,1/2,1/4,1/8,... L_min} such that the condition
|f(x_k+1)| <= (1-L_k/2) |f(x_k)|
is satisfied (or L_min if none of the values satisfies this criteria).
With the same criteria, another possible strategy is to start with L_0=1 and if the criteria is not met, try with L_0/2 until it works (or until L_0 = L_min). Then for L_1, start with min(1, 2L_0) and do the same. Then start with L_2=min(1, 2L_1) and so on.
By the way: are you sure that your problem has a unique solution? I guess that the answer to this question depends on the shape of your object. If you have a rugby ball, there's one angle that you cannot fix. So if your shape is close to such an object, I would not be surprised that the problem is difficult to solve for that angle.