Why my Gradient is wrong (Coursera, Logistic Regression, Julia)? - gradient

I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.
The Julia code to calculate the Gradient:
sigmoid(z) = 1 / (1 + e ^ -z)
hypotesis(theta, x) = sigmoid(scalar(theta' * x))
function gradient(theta, x, y)
(m, n) = size(x)
h = [hypotesis(theta, x[i,:]') for i in 1:m]
g = Array(Float64, n, 1)
for j in 1:n
g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.
The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.

The gradient is correct (up to a scalar multiple, as #roygvib points out). The problem is with the gradient descent.
If you look at the values of the cost function during your gradient descent, you will see a lot of NaN,
which probably come from the exponential:
lowering the step size (e.g., to 1e-5) will avoid the overflow,
but you will have to increase the number of iterations a lot (perhaps to 10_000_000).
A better (faster) solution would be to let the step size vary.
For instance, one could multiply the step size by 1.1
if the cost function improves after a step
(the optimum still looks far away in this direction: we can go faster),
and divide it by 2 if it does not (we went too fast and ended up past the minimum).
One could also do a line search in the direction of the gradient to find the best step size
(but this is time-consuming and can be replaced by approximations, e.g., Armijo's rule).
Rescaling the predictive variables also helps.

I tried comparing gradient() in the OP's code with numerical derivative of cost_j() (which is the objective function of minimization) using the following routine
function grad_num( theta, x, y )
g = zeros( 3 )
eps = 1.0e-6
disp = zeros( 3 )
for k = 1:3
disp[:] = theta[:]
disp[ k ]= theta[ k ] + eps
plus = cost_j( disp, x, y )
disp[ k ]= theta[ k ] - eps
minus = cost_j( disp, x, y )
g[ k ] = ( plus - minus ) / ( 2.0 * eps )
return g
But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y ), from which it seems that the division by m is missing:
#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )
#/ modified code
g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m
Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?
But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations)
c = cost_j( theta, x, y )
for i = 1:n_iterations
c_prev = c
c = cost_j( theta, x, y )
if c - c_prev < 0.0
alpha *= 1.01
alpha /= 1.05
theta[:] = theta - alpha * gradient(theta, x, y)
and called this routine as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
The variation of cost_j versus iteration steps is plotted below.


Converting Cartesian image to polar, appearance differences

I'm trying to do a polar transform on the first image below and end up with the second. However my result is the third image. I have a feeling it has to do with what location I choose as my "origin" but am unsure.
radius = sqrt(width**2 + height**2)
nheight = int(ceil(radius)/2)
nwidth = int(ceil(radius/2))
for y in range(0, height):
for x in range(0, width):
t = int(atan(y/x))
r = int(sqrt(x**2+y**2)/2)
color = getColor(getPixel(pic, x, y))
setColor( getPixel(radial,r,t), color)
There are a few differences / errors:
They use the centre of the image as the origin
They scale the axis appropriately. In your example, you're plotting your angle (between 0 and in your case, pi), instead of utilising the full height of the image.
You're using the wrong atan function (atan2 works a lot better in this situation :))
Not amazingly important, but you're rounding unnecessarily quite a lot, which throws off accuracy a little and can slow things down.
This is the code combining my suggested improvements. It's not massively efficient, but it should hopefully work :)
maxradius = sqrt(width**2 + height**2)/2
rscale = width / maxradius
tscale = height / (2*math.pi)
for y in range(0, height):
dy = y - height/2
for x in range(0, width):
dx = x - width/2
t = atan2(dy,dx)%(2*math.pi)
r = sqrt(dx**2+dy**2)
color = getColor(getPixel(pic, x, y))
setColor( getPixel(radial,int(r*rscale),int(t*tscale)), color)
In particular, it fixes the above problems in the following ways:
We use dx = x - width / 2 as a measure of distance from the centre, and similarly with dy. We then use these in replace of x, y throughout the computation.
We will have our r satisfying 0 <= r <= sqrt( (width/2)^2 +(height/2)^2 ), and our t eventually satisfying 0 < t <= 2 pi so, I create the appropriate scale factors to put r and t along the x and y axes respectively.
Normal atan can only distinguish based on gradients, and is computationally unstable near vertical lines... Instead, atan2 (see http://en.wikipedia.org/wiki/Atan2) solves both problems, and accepts (y,x) pairs to give an angle. atan2 returns an angle -pi < t <= pi, so we can find the remainder modulo 2 * math.pi to it to get it in the range 0 < t <= 2pi ready for scaling.
I've only rounded at the end, when the new pixels get set.
Any questions, just ask!

How do I encode Manhattan distance in Mixed Integer Programming

Lets have two points, (x1, y1) and (x2,y2)
dx = |x1 - x2|
dy = |y1 - y2|
D_manhattan = dx + dy where dx,dy >= 0
I am a bit stuck with how to get x1 - x2 positive for |x1 - x2|, presumably I introduce a binary variable representing the polarity, but I am not allowed multiplying a polarity switch to x1 - x2 as they are all unknown variables and that would result in a quadratic.
If you are minimizing an increasing function of |x| (or maximizing a decreasing function, of course),
you can always have the aboslute value of any quantity x in a lp as a variable absx such as:
absx >= x
absx >= -x
It works because the value absx will 'tend' to its lower bound, so it will either reach x or -x.
On the other hand, if you are minimizing a decreasing function of |x|, your problem is not convex and cannot be modelled as a lp.
For all those kind of questions, it would be much better to add a simplified version of your problem with the objective, as this it often usefull for all those modelling techniques.
What I meant is that there is no general solution to this kind of problem: you cannot in general represent an absolute value in a linear problem, although in practical cases it is often possible.
For example, consider the problem:
max y
y <= | x |
-1 <= x <= 2
0 <= y
it is bounded and has an obvious solution (2, 2), but it cannot be modelled as a lp because the domain is not convex (it looks like the shape under a 'M' if you draw it).
Without your model, it is not possible to answer the question I'm afraid.
Edit 2
I am sorry, I did not read the question correctly. If you can use binary variables and if all your distances are bounded by some constant M, you can do something.
We use:
a continuous variable ax to represent the absolute value of the quantity x
a binary variable sx to represent the sign of x (sx = 1 if x >= 0)
Those constraints are always verified if x < 0, and enforce ax = x otherwise:
ax <= x + M * (1 - sx)
ax >= x - M * (1 - sx)
Those constraints are always verified if x >= 0, and enforce ax = -x otherwise:
ax <= -x + M * sx
ax >= -x - M * sx
This is a variation of the "big M" method that is often used to linearize quadratic terms. Of course you need to have an upper bound of all the possible values of x (which, in your case, will be the value of your distance: this will typically be the case if your points are in some bounded area)

Newton Raphson with SSE2 - can someone explain me these 3 lines

I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing
and I stumbled upon these three lines of code:
The SIMD version is already quite a bit faster, but we can do better.
Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set.
The only drawback is that its precision is limited. We need the
precision, so we refine it using Newton-Rhapson:
__m128 nr = _mm_rsqrt_ps( x );
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr );
result = _mm_mul_ps( _mm_mul_ps( half, nr ), _mm_sub_ps( three, muls ) );
This code assumes the existence of a __m128 variable named 'half'
(four times 0.5f) and a variable 'three' (four times 3.0f).
I know how to use Newton Raphson to calculate a function's zero and I know how to use it to calculate the square root of a number but I just can't see how this code performs it.
Can someone explain it to me please?
Given the Newton iteration , it should be quite straight forward to see this in the source code.
__m128 nr = _mm_rsqrt_ps( x ); // The initial approximation y_0
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr ); // muls = x*nr*nr == x(y_n)^2
result = _mm_mul_ps(
_mm_sub_ps( three, muls ) // this is 3.0 - mul;
/*multiplied by */ __mm_mul_ps(half,nr) // y_0 / 2 or y_0 * 0.5
And to be precise, this algorithm is for the inverse square root.
Note that this still doesn't give fully a fully accurate result. rsqrtps with a NR iteration gives almost 23 bits of accuracy, vs. sqrtps's 24 bits with correct rounding for the last bit.
The limited accuracy is an issue if you want to truncate the result to integer. (int)4.99999 is 4. Also, watch out for the x == 0.0 case if using sqrt(x) ~= x * sqrt(x), because 0 * +Inf = NaN.
To compute the inverse square root of a, Newton's method is applied to the equation 0=f(x)=a-x^(-2) with derivative f'(x)=2*x^(-3) and thus the iteration step
N(x) = x - f(x)/f'(x) = x - (a*x^3-x)/2
= x/2 * (3 - a*x^2)
This division-free method has -- in contrast to the globally converging Heron's method -- a limited region of convergence, so you need an already good approximation of the inverse square root to get a better approximation.

How to fit the 2D scatter data with a line with C++

I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.
This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:
#include <vector>
#include <cmath>
struct Point {
double _x, _y;
struct Line {
double _slope, _yInt;
double getYforX(double x) {
return _slope*x + _yInt;
// Construct line from points
bool fitPoints(const std::vector<Point> &pts) {
int nPoints = pts.size();
if( nPoints < 2 ) {
// Fail: infinitely many lines passing through this single point
return false;
double sumX=0, sumY=0, sumXY=0, sumX2=0;
for(int i=0; i<nPoints; i++) {
sumX += pts[i]._x;
sumY += pts[i]._y;
sumXY += pts[i]._x * pts[i]._y;
sumX2 += pts[i]._x * pts[i]._x;
double xMean = sumX / nPoints;
double yMean = sumY / nPoints;
double denominator = sumX2 - sumX * xMean;
// You can tune the eps (1e-7) below for your specific task
if( std::fabs(denominator) < 1e-7 ) {
// Fail: it seems a vertical line
return false;
_slope = (sumXY - sumX * yMean) / denominator;
_yInt = yMean - _slope * xMean;
return true;
Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use
y = k*x + b
line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses
A*x + B*y + C = 0
line equation. You can still modify the current algorithm to produce that equation:
y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b
In terms of the above code:
B=1, A=-_slope, C=-_yInt
And in "then" block of the if checking for denominator equal to 0, instead of // Fail: it seems a vertical line, produce the following line equation:
x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean
I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html
double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;
The formulas are identical because nPoints*xMean == sumX and nPoints*xMean*yMean == sumX * yMean == sumY * xMean.
I would suggest coding it from scratch. It is a very simple implementation in C++. You can code up both the intercept and gradient for least-squares fit (the same method as polyfit) from your data directly from the formulas here
These are closed form formulas that you can easily evaluate yourself using loops. If you were using higher degree fits then I would suggest a matrix library or more sophisticated algorithms but for simple linear regression as you describe above this is all you need. Matrices and linear algebra routines would be overkill for such a problem (in my opinion).
Equation of line is Ax + By + C=0.
So it can be easily( when B is not so close to zero ) convert to y = (-A/B)*x + (-C/B)
typedef double scalar_type;
typedef std::array< scalar_type, 2 > point_type;
typedef std::vector< point_type > cloud_type;
bool fit( scalar_type & A, scalar_type & B, scalar_type & C, cloud_type const& cloud )
if( cloud.size() < 2 ){ return false; }
scalar_type X=0, Y=0, XY=0, X2=0, Y2=0;
for( auto const& point: cloud )
{ // Do all calculation symmetric regarding X and Y
X += point[0];
Y += point[1];
XY += point[0] * point[1];
X2 += point[0] * point[0];
Y2 += point[1] * point[1];
X /= cloud.size();
Y /= cloud.size();
XY /= cloud.size();
X2 /= cloud.size();
Y2 /= cloud.size();
A = - ( XY - X * Y ); //!< Common for both solution
scalar_type Bx = X2 - X * X;
scalar_type By = Y2 - Y * Y;
if( fabs( Bx ) < fabs( By ) ) //!< Test verticality/horizontality
{ // Line is more Vertical.
B = By;
{ // Line is more Horizontal.
// Classical solution, when we expect more horizontal-like line
B = Bx;
C = - ( A * X + B * Y );
//Optional normalization:
// scalar_type D = sqrt( A*A + B*B );
// A /= D;
// B /= D;
// C /= D;
return true;
You can also use or go over this implementation there is also documentation here.
Fitting a Line can be acomplished in different ways.
Least Square means minimizing the sum of the squared distance.
But you could take another cost function as example the (not squared) distance. But normaly you use the squred distance (Least Square).
There is also a possibility to define the distance in different ways. Normaly you just use the "y"-axis for the distance. But you could also use the total/orthogonal distance. There the distance is calculated in x- and y-direction. This can be a better fit if you have also errors in x direction (let it be the time of measurment) and you didn't start the measurment on the exact time you saved in the data. For Least Square and Total Least Square Line fit exist algorithms in closed form. So if you fitted with one of those you will get the line with the minimal sum of the squared distance to the datapoints. You can't fit a better line in the sence of your defenition. You could just change the definition as examples taking another cost function or defining distance in another way.
There is a lot of stuff about fitting models into data you could think of, but normaly they all use the "Least Square Line Fit" and you should be fine most times. But if you have a special case it can be necessary to think about what your doing. Taking Least Square done in maybe a few minutes. Thinking about what Method fits you best to the problem envolves understanding the math, which can take indefinit time :-).
Note: This answer is NOT AN ANSWER TO THIS QUESTION but to this one "Line closest to a set of points" that has been flagged as "duplicate" of this one (incorrectly in my opinion), no way to add new answers to it.
The question asks for:
Find the line whose distance from all the points is minimum ? By
distance I mean the shortest distance between the point and the line.
The most usual interpretation of distance "between the point and the line" is the euclidean distance and the most common interpretation of "from all points" is the sum of distances (in absolute or squared value).
When the target is minimize the sum of squared euclidean distances, the linear regression (LST) is not the algorithm to use. In addition, linear regression can not result in a vertical line. The algorithm to be used is the "total least squares". See by example wikipedia for the problem description and this answer in math stack exchange for details about the formulation.
to fit a line y=param[0]x+param[1] simply do this:
// loop over data:
sum_x += x[i];
sum_y += y[i];
sum_xy += x[i] * y[i];
sum_x2 += x[i] * x[i];
// means
double mean_x = sum_x / ninliers;
double mean_y = sum_y / ninliers;
float varx = sum_x2 - sum_x * mean_x;
float cov = sum_xy - sum_x * mean_y;
// check for zero varx
param[0] = cov / varx;
param[1] = mean_y - param[0] * mean_x;
More on the topic http://easycalculation.com/statistics/learn-regression.php
(formulas are the same, they just multiplied and divided by N, a sample sz.). If you want to fit plane to 3D data use a similar approach -
Disclaimer: all quadratic fits are linear and optimal in a sense that they reduce the noise in parameters. However, you might interested in the reducing noise in the data instead. You might also want to ignore outliers since they can bia s your solutions greatly. Both problems can be solved with RANSAC. See my post at:

Probability density function from a paper, implemented using C++, not working as intended

So i'm implementing a heuristic algorithm, and i've come across this function.
I have an array of 1 to n (0 to n-1 on C, w/e). I want to choose a number of elements i'll copy to another array. Given a parameter y, (0 < y <= 1), i want to have a distribution of numbers whose average is (y * n). That means that whenever i call this function, it gives me a number, between 0 and n, and the average of these numbers is y*n.
According to the author, "l" is a random number: 0 < l < n . On my test code its currently generating 0 <= l <= n. And i had the right code, but i'm messing with this for hours now, and i'm lazy to code it back.
So i coded the first part of the function, for y <= 0.5
I set y to 0.2, and n to 100. That means it had to return a number between 0 and 99, with average 20.
And the results aren't between 0 and n, but some floats. And the bigger n is, smaller this float is.
This is the C test code. "x" is the "l" parameter.
//hate how code tag works, it's not even working now
int n = 100;
float y = 0.2;
float n_copy;
for(int i = 0 ; i < 20 ; i++)
float x = (float) (rand()/(float)RAND_MAX); // 0 <= x <= 1
x = x * n; // 0 <= x <= n
float p1 = (1 - y) / (n*y);
float p2 = (1 - ( x / n ));
float exp = (1 - (2*y)) / y;
p2 = pow(p2, exp);
n_copy = p1 * p2;
printf("%.5f\n", n_copy);
And here are some results (5 decimals truncated):
The article is:
pages 6 and 7.
or search "cAS: cunning ant system" on google.
So what am i doing wrong? i don't believe the author is wrong, because there are more than 5 papers describing this same function.
all my internets to whoever helps me. This is important to my work.
Thanks :)
You may misunderstand what is expected of you.
Given a (properly normalized) PDF, and wanting to throw a random distribution consistent with it, you form the Cumulative Probability Distribution (CDF) by integrating the PDF, then invert the CDF, and use a uniform random predicate as the argument of the inverted function.
A little more detail.
f_s(l) is the PDF, and has been normalized on [0,n).
Now you integrate it to form the CDF
g_s(l') = \int_0^{l'} dl f_s(l)
Note that this is a definite integral to an unspecified endpoint which I have called l'. The CDF is accordingly a function of l'. Assuming we have the normalization right, g_s(N) = 1.0. If this is not so we apply a simple coefficient to fix it.
Next invert the CDF and call the result G^{-1}(x). For this you'll probably want to choose a particular value of gamma.
Then throw uniform random number on [0,n), and use those as the argument, x, to G^{-1}. The result should lie between [0,1), and should be distributed according to f_s.
Like Justin said, you can use a computer algebra system for the math.
dmckee is actually correct, but I thought that I would elaborate more and try to explain away some of the confusion here. I could definitely fail. f_s(l), the function you have in your pretty formula above, is the probability distribution function. It tells you, for a given input l between 0 and n, the probability that l is the segment length. The sum (integral) for all values between 0 and n should be equal to 1.
The graph at the top of page 7 confuses this point. It plots l vs. f_s(l), but you have to watch out for the stray factors it puts on the side. You notice that the values on the bottom go from 0 to 1, but there is a factor of x n on the side, which means that the l values actually go from 0 to n. Also, on the y-axis there is a x 1/n which means these values don't actually go up to about 3, they go to 3/n.
So what do you do now? Well, you need to solve for the cumulative distribution function by integrating the probability distribution function over l which actually turns out to be not too bad (I did it with the Wolfram Mathematica Online Integrator by using x for l and using only the equation for y <= .5). That however was using an indefinite integral and you are really integration along x from 0 to l. If we set the resulting equation equal to some variable (z for instance), the goal now is to solve for l as a function of z. z here is a random number between 0 and 1. You can try using a symbolic solver for this part if you would like (I would). Then you have not only achieved your goal of being able to pick random ls from this distribution, you have also achieved nirvana.
A little more work done
I'll help a little bit more. I tried doing what I said about for y <= .5, but the symbolic algebra system I was using wasn't able to do the inversion (some other system might be able to). However, then I decided to try using the equation for .5 < y <= 1. This turns out to be much easier. If I change l to x in f_s(l) I get
y / n / (1 - y) * (x / n)^((2 * y - 1) / (1 - y))
Integrating this over x from 0 to l I got (using Mathematica's Online Integrator):
(l / n)^(y / (1 - y))
It doesn't get much nicer than that with this sort of thing. If I set this equal to z and solve for l I get:
l = n * z^(1 / y - 1) for .5 < y <= 1
One quick check is for y = 1. In this case, we get l = n no matter what z is. So far so good. Now, you just generate z (a random number between 0 and 1) and you get an l that is distributed as you desired for .5 < y <= 1. But wait, looking at the graph on page 7 you notice that the probability distribution function is symmetric. That means that we can use the above result to find the value for 0 < y <= .5. We just change l -> n-l and y -> 1-y and get
n - l = n * z^(1 / (1 - y) - 1)
l = n * (1 - z^(1 / (1 - y) - 1)) for 0 < y <= .5
Anyway, that should solve your problem unless I made some error somewhere. Good luck.
Given that for any values l, y, n as described, the terms you call p1 and p2 are both in [0,1) and exp is in [1,..) making pow(p2, exp) also in [0,1) thus I don't see how you'd ever get an output with the range [0,n)