Exploding gradient for gpflow SVGP - gradient

When optimizing a SVGP with Poisson Likelihood for a big data set I see what I think are exploding gradients.
After a few epochs I see a spiky drop of the ELBO, which then very slowly recovers after getting rid of all progress made before.
Roughly 21 iterations correspond to an Epoch.
This spike (at least the second one) resulted in a complete shift of the parameters (for vectors of parameters I just plotted the norm to see changes):
How can I deal with that? My first approach would be to clip the gradient, but that seems to require digging around the gpflow code.
My Setup:
Training works via Natural Gradients for the variational parameters and ADAM for the rest, with a slowly (linearly) increasing schedule for the Natural Gradient Gamma.
The batch and inducing point sizes are as large as possible for my setup
(both 2^12, with the data set consisting of ~88k samples). I include 1e-5 jitter and initialize the inducing points with kmeans.
I use a combined kernel, consisting of a combination of RBF, Matern52, a periodic and a linear kernel on a total of 95 features (a lot of them due to a one-hot encoding), all learnable.
The lengthscales are transformed with gpflow.transforms.
with gpflow.defer_build():
k1 = Matern52(input_dim=len(kernel_idxs["coords"]), active_dims=kernel_idxs["coords"], ARD=False)
k2 = Periodic(input_dim=len(kernel_idxs["wday"]), active_dims=kernel_idxs["wday"])
k3 = Linear(input_dim=len(kernel_idxs["onehot"]), active_dims=kernel_idxs["onehot"], ARD=True)
k4 = RBF(input_dim=len(kernel_idxs["rest"]), active_dims=kernel_idxs["rest"], ARD=True)
#
k1.lengthscales.transform = gpflow.transforms.Exp()
k2.lengthscales.transform = gpflow.transforms.Exp()
k3.variance.transform = gpflow.transforms.Exp()
k4.lengthscales.transform = gpflow.transforms.Exp()
m = gpflow.models.SVGP(X, Y, k1 + k2 + k3 + k4, gpflow.likelihoods.Poisson(), Z,
mean_function=gpflow.mean_functions.Constant(c=np.ones(1)),
minibatch_size=MB_SIZE, name=NAME)
m.mean_function.set_trainable(False)
m.compile()
UPDATE: Using only ADAM
Following the suggestion by Mark, I switched to ADAM only,
which helped me get rid of that sudden explosion. However, I still initialized with an epoch of natgrad only, which seems to save a lot of time.
In addition, the variational parameters seem to change a lot less abrupt (in terms of their norm at least). I guess they'll converge way slower now, but at least it's stable.

Just to add to Mark's answer above, when using nat grads in non-conjugate models it can take a bit of tuning to get the best performance, and instability is potentially a problem. As Mark points out, the large steps that provide potentially faster convergence can also lead to the parameters ending up in in bad regions of the parameter space. When the variational approximation is good (i.e. the true and approximate posterior are close) then there is good reason to expect that the nat grad will perform well, but unfortunately there is no silver bullet in the general case. See https://arxiv.org/abs/1903.02984 for some intuition.

This is very interesting. Perhaps trying to not use natgrads is a good idea as well. Clipping gradients indeed seems like a hack that could work. And yes, this would require digging around in the GPflow code a bit. One tip that can help towards this, is by not using the GPflow optimisers directly. The model._likelihood_tensor contains the TF tensor that should be optimised. Perhaps with some manual TensorFlow magic, you can do the gradient clipping on here before running an optimiser.
In general, I think this sounds like you've stumbled on an actual research problem. Usually these large gradients have a good reason in the model, which can be addressed with careful thought. Is it variance in some monte carlo estimate? Is the objective function behaving badly?
Regarding why not using natural gradients helps. Natural gradients use the Fisher matrix as a preconditioner to perform second order optimisation. Doing so can result in quite aggressive moves in parameter space. In certain cases (when there are usable conjugacy relations) these aggressive moves can make optimisation much faster. This case, with the Poisson likelihood, is not one where there are conjugacy relations that will necessarily help optimisation. In fact, the Fisher preconditioner can often be detrimental, particularly when variational parameters are not near the optimum.

Related

DoCPLEX Solving LP Problem Partially at a time

I am working on Linear Programming Problem with 800K Constraints and the problem takes 20 mins to solve but if I solve the problem for half horizon it just takes 1 min. Is there a way in DoCPLEX where I can solve for partial horizon and then use the solution to solve for other half of the problem without using a for-loop
Three suggestions:
load your problem as LP or SAV into cplex interactive optimizer and run display problem stats. This might show (or rule out) precision issues (ill-conditioned problem). Also it will output number of nonzeros
set datacheck parameters to 2, this might detect numerical issues in data
have you tried different LP algorithms? Using the lpmethod parameter you could try primal, dual or barrier algorithm to see whether one runs faster on your problem.
Reference:
https://www.ibm.com/support/knowledgecenter/SSSA5P_12.10.0/ilog.odms.cplex.help/CPLEX/Parameters/topics/LPMETHOD.html
In DOcplex:
model.parameters.datacheck = 2
model.parameters.lpmethod = 4 # for barrier
From your answers, I can think of the following:
if you are in pure LP (is this true?) I see no point in rounding numbers (but yes, that would help in a MIP, try rounding coefficients whose fractional part is say less than 1e-7: 4.0000001 -> 4)
1e+14 conditioning denotes serious modeling issue: a common source is mixing different objectives with coefficients. Have you tried multi-objective to avoid that?
Another source is big_M formulations, to which you should prefer indicator constraints. If you are not in these two cases, then try to renormalize the data to keep in a smaller condition range...
Finally, you might try setting markowitz tolerance to 0.99, to add extra cautiouness in simplex factorizations, but behavior may vary from one dataset to the other...

Fast approximation of square_root(x) with known 0<=x<=1

How to approximate sqrt(float32bit x) if I know that x<=1?
There have to be some tricks to exploit the range x<=1.
The return result doesn't have to be precise, the maximum error may be <0.001.
(0.001 is just a magic number, you can change it.)
I don't mind language, but I prefer C++, in CPU (not GPU).
I think explicit formula is better than table look-up.
It is useful for particles in my 3D game (VS 2015 + Ogre3D + Bullet), and I can't find any clue about it.
I doubt the reason of the downvote storm is that it looks like an assignment / interview?
... or the solution is already well-known?
Square roots are ordinarily computed through FSQRT which getting increasingly faster. It is a single instruction that you practically can't beat. The fast inverse square root for example won't help unless you can use its result directly. If you have to invert it again, that FDIV alone will take roughly as much time as FSQRT would have.

What causes shadow acne?

I have been reading up on shadow mapping, and found the following tutorial:
http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-16-shadow-mapping/
It makes sense to me up until the point where the author starts discussing the "shadow acne" artifact. They explain the cause with the following diagram (with no words):
I am still having a lot of trouble understanding what actually causes shadow acne and why adding a bias fixes it.
It seems that the resolution of the shadow map has no effect on acne. What is it then? Maybe float precision, or is it something else?
Yes, it is a precision issue. Not really a float problem, just finite precision.
In theory the shadow map stores "distance to closest object from light". But in practice it stores "distance±eps from light".
Then when testing, you have your fragments distance to the same light. But again, in practice ± eps2. So if you compare those two values it turns out that eps varies differently when interpolating for shadow map rendering or shading. So if you compare d ± eps < d2 ± eps2, if d2==d, you might get the wrong result because eps!=eps2. But if you compare d ± eps < d2 + max(eps) + max(eps2) ± eps2 you will be fine.
In this example d2==d. That is called self shadowing. And can be easily fixed with the above bias, or by simply not testing against yourself in raytracing.
It gets much more tricky with different objects and when eps and eps2 are vastly different. One way to deal with it is to control eps (http://developer.download.nvidia.com/SDK/10.5/opengl/src/cascaded_shadow_maps/doc/cascaded_shadow_maps.pdf). Or one can just take a lot more samples.
To try to answer the question: The core issue is that shadow mapping compares ideal distances. But those distances are not ideal but quantized. And quantized values are usually fine, but in this case we are comparing them in two different spaces.

Finding an optimal solution to a system of linear equations in c++

Here's the problem:
I am currently trying to create a control system which is required to find a solution to a series of complex linear equations without a unique solution.
My problem arises because there will ever only be six equations, while there may be upwards of 20 unknowns (usually way more than six unknowns). Of course, this will not yield an exact solution through the standard Gaussian elimination or by changing them in a matrix to reduced row echelon form.
However, I think that I may be able to optimize things further and get a more accurate solution because I know that each of the unknowns cannot have a value smaller than zero or greater than one, but it is free to take on any value in between them.
Of course, I am trying to create code that would find a correct solution, but in the case that there are multiple combinations that yield satisfactory results, I would want to minimize Sum of (value of unknown * efficiency constant) over all unknowns, i.e. Sigma[xI*eI] from I=0 to n, but finding an accurate solution is of a greater priority.
Performance is also important, due to the fact that this algorithm may need to be run several times per second.
So, does anyone have any ideas to help me on implementing this?
Edit: You might just want to stick to linear programming with equality and inequality constraints, but here's an interesting exact solution that does not incorporate the constraint that your unknowns are between 0 and 1.
Here's a powerpoint discussing your problem: http://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf
I'll translate your problem into math to make things a bit easier to figure out:
you have a 6x20 matrix A and a vector x with 20 elements. You want to minimize (x^T)e subject to Ax=y. According to the slides, if you were just minimizing the sum of x, then the answer is A^T(AA^T)^(-1)y. I'll take another look at this as soon as I get the chance and see what the solution is to minimizing (x^T)e (ie your specific problem).
Edit: I looked in the powerpoint some more and near the end there's a slide entitled "General norm minimization with equality constraints". I am going to switch the notation to match the slide's:
Your problem is that you want to minimize ||Ax-b||, where b = 0 and A is your e vector and x is the 20 unknowns. This is subject to Cx=d. Apparently the answer is:
x=(A^T A)^-1 (A^T b -C^T(C(A^T A)^-1 C^T)^-1 (C(A^T A)^-1 A^Tb - d))
it's not pretty, but it's not as bad as you might think. There's really aren't that many calculations. For example (A^TA)^-1 only needs to be calculated once and then you can reuse the answer. And your matrices aren't that big.
Note that I didn't incorporate the constraint that the elements of x are within [0,1].
It looks like the solution for what I am doing is with Linear Programming. It is starting to come back to me, but if I have other problems I will post them in their own dedicated questions instead of turning this into an encyclopedia.

Weighted linear least square for 2D data point sets

My question is an extension of the discussion How to fit the 2D scatter data with a line with C++. Now I want to extend my question further: when estimating the line that fits 2D scatter data, it would be better if we can treat each 2D scatter data differently. That is to say, if the scatter point is far away from the line, we can give the point a low weighting, and vice versa. Therefore, the question then becomes: given an array of 2D scatter points as well as their weighting factors, how can we estimate the linear line that passes them? A good implementation of this method can be found in this article (weighted least regression). However, the implementation of the algorithm in that article is too complicated as it involves matrix calculation. I am therefore trying to find a method without matrix calculation. The algorithm is an extension of simple linear regression, and in order to illustrate the algorithm, I wrote the following MATLAB codes:
function line = weighted_least_squre_for_line(x,y,weighting);
part1 = sum(weighting.*x.*y)*sum(weighting(:));
part2 = sum((weighting.*x))*sum((weighting.*y));
part3 = sum( x.^2.*weighting)*sum(weighting(:));
part4 = sum(weighting.*x).^2;
beta = (part1-part2)/(part3-part4);
alpha = (sum(weighting.*y)-beta*sum(weighting.*x))/sum(weighting);
a = beta;
c = alpha;
b = -1;
line = [a b c];
In the above codes, x,y,weighting represent the x-coordinate, y-coordinate and the weighting factor respectively. I test the algorithm with several examples but still not sure whether it is right or not as this method gets a different result with Polyfit, which relies on matrix calculation. I am now posting the implementation here and for your advice. Do you think it is a right implementation? Thanks!
If you think it is a good idea to downweight points that are far from the line, you might be attracted by http://en.wikipedia.org/wiki/Least_absolute_deviations, because one way of calculating this is via http://en.wikipedia.org/wiki/Iteratively_re-weighted_least_squares, which will give less weight to points far from the line.
If you think all your points are "good data", then it would be a mistake to weight them naively according to their distance from your initial fit. However, it's a fairly common practice to discard "outliers": if a few data points are implausibly far from the fit, and you have reason to believe that there's an error mechanism that could generate a small subset of "bad" datapoints, you could simply remove the implausible points from the dataset to get a better fit.
As far as the math is concerned, I would recommend biting the bullet and trying to figure out the matrix math. Perhaps you could find a different article, or a book which has a better presentation. I will not comment on your Matlab code, except to say that it looks like you will have some precision problems when subtracting part4 from part3, and probably part2 from part1 as well.
Not exactly what you are asking for, but you should look into robust regression. MATLAB has the function robustfit (requires Statistics Toolbox).
There is even an interactive demo you can play with to compare regular linear regression vs. robust regression:
>> robustdemo
This shows that in the presence of outliers, robust regression tends to give better results.