Solving Gaussian Process posterior analytically in PyMC3 - pymc3

I'm used to doing a Gaussian Process regression in GPFlow which lets you do this to solve for the posterior analytically:
import gpflow as gp
from gpflow.kernels import RBF, White, Periodic, Linear
k = RBF(x.shape[1]) + White(x.shape[1])
m = gp.models.GPR(x, y, k)
self.model = m
m.compile()
opt = gp.train.ScipyOptimizer()
opt.minimize(m)
I've recently moved to PyMC3 and am trying to accomplish the same thing as above. I've found in the documentation this bit of code (https://docs.pymc.io/notebooks/GP-slice-sampling.html#Examine-actual-posterior-distribution):
# Analytically compute posterior mean
L = np.linalg.cholesky(K_noise.eval())
alpha = np.linalg.solve(L.T, np.linalg.solve(L, f))
post_mean = np.dot(K_s.T.eval(), alpha)
Ultimately I want to be using the GP for a regression on unseen data. Is using np.linalg the right way to analytically solve a GP posterior?

Sure. As stated in the tutorial, they implemented Algorithm 2.1 from Rasmussen's GPML, and he explicitly uses left matrix divide notation (\), which indicates to use a linear solve. For example, in theory (i.e., real number system),
A\b === A^(-1) * b === x
where x solves A*x = b. But in a practical computational domain (e.g., IEEE floating point), this equivalence breaks down because solve(A, b) is faster and more numerically stable than inv(A) * b.
The left matrix divide (\) notation is commonplace in Numerical Linear Algebra, and I'd venture that the most salient explanation for its preference is that it tacitly reminds students to never compute a matrix inverse, unnecessarily.

Related

Maximizing a Ratio/Percent

I'm using cvxpy to model a problem.
Inside a very large and complex LP, I create two continuous, affine (unconstrained) expressions:x and y.
Due to how they are created, I know that 0 < x < y <= U. Obviously: x/y < 1.
In my LP objective, how do I maximize the ratio x/y?
Things I tried:
Maximizing x*cp.inv_pos(Y) states my problem is non DCP (also if I try to minimize the inverse)
I found various LP formulations for maximizing ratios (e.g. here or here) but these requires rewriting the constraints on all the terms in the expressions for x - I have no idea how to do that with cvxpy.
If this is the way to go then an example would be very helpful!

Matching features of images using PCA-SIFT

I want to match features in two images to detect copy-move forgery. I used the PCA-SIFT code to detect image features. But, I am having trouble in matching the PCA-SIFT features. According to several papers, similar matching process is used for PCA-SIFT as is used in SIFT. I have used the following code snippet to match features.
%des1 and des2 are the PCA-SIFT descriptors obtained from two images
% Precompute matrix transpose
des2t = des2';
matchTable = zeros(1,size(des1,1));
cnt=0; %no. of matches
%ration of ditances
distRatio = 0.5;
%normalising features
m1=max(max(des1));
m2=max(max(des2));
m=max(m1,m2);
des1=des1./m;
des2=des2./m;
for i = 1 : size(des1,1)
%finding eucledian distance of a vector in one image to all features in second image
A=des1(i,:);
D = des2-repmat(A,size(des2,1),1);
[vals,indx] = sort((sum(D.^2,2)).^(1/2)); %sort distances
% Check if nearest neighbor has angle less than distRatio times 2nd.
if (vals(1) < distRatio * vals(2))
matchTable(i) = indx(1);
cnt=cnt+1;
else
matchTable(i) = 0;
end
end
cnt
The above code works fine for SIFT features. But I am not able to get correct results for PCA-SIFT features even after trying several values of distRatio(0-1). I'm also not sure if the matlab central code for PCA-SIFT(mentioned above) does the exact process as mentioned in this paper
If somebody has any idea about the above problem then please comment.
The problem is, PCA does not preserve euclidean distance between 2 vectors. Take a simple example where your data is along the line y = x. The distance between 2 points along the line will depend on both co-ordinates, even if all your data is 1 dimensional, i.e. lying along the line. When you apply PCA, the new euclidean distance will only take the principle component into account, which would be the line y=x, so distance between (1,1), (2,2) would just be 1 instead of sqrt(2).
However, if you normalize the features by their euclidean norm, nearest neighbor using euclidean distance is equivalent to computing cosine-similarity (dot-product) between features.
https://en.wikipedia.org/wiki/Cosine_similarity
Therefore I would first recommend you to test if matching for sift features works if you normalize them by their L2 norm. If yes, you could apply PCA on those features, again normalize the PCA features by their L2 norm and then compute euclidean distance. As far as I remember, L2 norm of a sift vector is 1. So, you only need normalize your PCA-SIFT features by their L2 norm and compute euclidean distance.

Precomputed distances for spectral clustering with scikit-learn

I'm struggling to make sense of the spectral clustering documentation here.
Specifically.
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:
np.exp(- X ** 2 / (2. * delta ** 2))
For my data, I have a complete distance matrix of size (n_samples, n_samples) where large entries represent dissimilar pairs, small values represent similar pairs and zero represents identical entries. (I.e. the only zeros are along the diagonal).
So all I need to do is build the SpectralClustering object with affinity = "precomputed" and then pass the transformed distance matrix to fit_predict.
I'm stuck on the suggested transformation equation. np.exp(- X ** 2 / (2. * delta ** 2)).
What is X here? The (n_samples, n_samples) distance matrix?
If so, what is delta. Is it just X.max()-X.min()?
Calling np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) seems to do the right thing. I.e. big entries become relatively small, and small entries relatively big, with all the entries between 0 and 1. The diagonal is all 1's, which makes sense, since each point is most affine with itself.
But I'm worried. I think if the author had wanted me to use np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) he would have told me to use just that, instead of throwing delta in there.
So I guess my question is just this. What's delta?
Yes, X in this case is the matrix of distances. delta is a scale parameter that you can tune as you wish. It controls the "tightness", so to speak, of the distance/similarity relation, in the sense that a small delta increases the relative dissimmilarity of faraway points.
Notice that delta is proportional to the inverse of the gamma parameter of the RBF kernel, mentioned earlier in the doc link you give: both are free parameters which can be used to tune the clustering results.

How to prevent Gurobi from converting a Max into a Min prob?

I am using Gurobi (via C++) as part of my MSc thesis to solve Quadratic Knapsack Problem instances. So far, I was able to generate a model with binary decision variables, quadratic objective function and the capacity constraint and Gurobi solved it just fine. Then I wanted to solve the continuous relaxation of the QKP. I built the model as before but with continuous variables instead of binary ones and Gurobi threw me an exception when I tried to optimize it:
10020 - Objective Q not PSD (negative diagonal entry)
Which confounded me for a bit since all values form the problem instance are ≥0. In preparing to post this question, I wrote both models out to file and discovered the reason:
NAME (null)
* Max problem is converted into Min one
Which of course means that all previously positive values are now negative. Now I know why Q is not PSD but how do I fix this? Can I prevent the conversion from a Max problem into a Min one? Do I need to configure the model for the continuous relaxation differently?
From my (unexperienced) point of view it just looks like Gurobi shot itself in the foot.
When you are maximizing a quadratic objective with Gurobi or any other convex optimizer, your 'Q' matrix has to be negative semi-definite, and when you are minimizing, your 'Q' matrix needs to be positive definite. Changing the sign and the sense of the objective changes nothing.
Gurobi doesn't verify that your problem is convex, but it will report any non-convexity it finds. The fact that your original problem seemed to solve as a MIP is an accident and you shouldn't rely on it.
You should model a quadratic objective with binary variables as a linear problem with some simple transformations. If x and y are binary, the expression x*y can be changed to z if you add the constraints
z <= x
z <= y
z >= x + y - 1

My neural net learns sin x but not cos x

I have build my own neural net and I have a weird problem with it.
The net is quite a simple feed-forward 1-N-1 net with back propagation learning. Sigmoid is used as activation function.
My training set is generated with random values between [-PI, PI] and their [0,1]-scaled sine values (This is because the "Sigmoid-net" produces only values between [0,1] and unscaled sine -function produces values between [-1,1]).
With that training-set, and the net set to 1-10-1 with learning rate of 0.5, everything works great and the net learns sin-function as it should. BUT.. if I do everything exately the same way for COSINE -function, the net won't learn it. Not with any setup of hidden layer size or learning rate.
Any ideas? Am I missing something?
EDIT: My problem seems to be similar than can be seen with this applet. It won't seem to learn sine-function unless something "easier" is taught for the weights first (like 1400 cycles of quadratic function). All the other settings in the applet can be left as they initially are. So in the case of sine or cosine it seems that the weights need some boosting to atleast partially right direction before a solution can be found. Why is this?
I'm struggling to see how this could work.
You have, as far as I can see, 1 input, N nodes in 1 layer, then 1 output. So there is no difference between any of the nodes in the hidden layer of the net. Suppose you have an input x, and a set of weights wi. Then the output node y will have the value:
y = Σi w_i x
= x . Σi w_i
So this is always linear.
In order for the nodes to be able to learn differently, they must be wired differently and/or have access to different inputs. So you could supply inputs of the value, the square root of the value (giving some effect of scale), etc and wire different hidden layer nodes to different inputs, and I suspect you'll need at least one more hidden layer anyway.
The neural net is not magic. It produces a set of specific weights for a weighted sum. Since you can derive a set weights to approximate a sine or cosine function, that must inform your idea of what inputs the neural net will need in order to have some chance of succeeding.
An explicit example: the Taylor series of the exponential function is:
exp(x) = 1 + x/1! + x^2/2! + x^3/3! + x^4/4! ...
So if you supplied 6 input notes with 1, x1, x2 etc, then a neural net that just received each input to one corresponding node, and multiplied it by its weight then fed all those outputs to the output node would be capable of the 6-term taylor expansion of the exponential:
in hid out
1 ---- h0 -\
x -- h1 --\
x^2 -- h2 ---\
x^3 -- h3 ----- y
x^4 -- h4 ---/
x^5 -- h5 --/
Not much of a neural net, but you get the point.
Further down the wikipedia page on Taylor series, there are expansions for sin and cos, which are given in terms of odd powers of x and even powers of x respectively (think about it, sin is odd, cos is even, and yes it is that straightforward), so if you supply all the powers of x I would guess that the sin and cos versions will look pretty similar with alternating zero weights. (sin: 0, 1, 0, -1/6..., cos: 1, 0, -1/2...)
I think you can always compute sine and then compute cosine externally. I think your concern here is why the neural net is not learning the cosine function when it can learn the sine function. Assuming that this artifact if not because of your code; I would suggest the following:
It definitely looks like an error in the learning algorithm. Could be because of your starting point. Try starting with weights that gives the correct result for the first input and then march forward.
Check if there is heavy bias in your learning - more +ve than -ve
Since cosine can be computed by sine 90 minus angle, you could find the weights and then recompute the weights in 1 step for cosine.