Theano: ifelse for three cases - if-statement

I'm trying to implement the neural network model described in this paper. The loss function, however, consists of a 4 part if/else block, similar to this structure:
if correct: loss = 0
elif <condition1>: loss = 0.5
elif <condition2>: loss = 0.2
else: loss = 0.4
I'm aware of the theano.ifelse.ifelse op. However, in order to implement this structure, there would be four nested ifelse cases. Is there an easier way to implement these four cases?
(For the record, I actually implemented a nested ifelse in theano, but I ran to the same bug as this google groups post.)

Related

Implementing a constraint based on previous variable's value in GNU Mathprog/AMPL

I have a binary program and one of my variables, x_it is defined on two sets, being I: Set of objects and T: Set of the weeks of the year, thus x_it is a binary variable standing for whether object i is assigned to something on week t. The constraint I failed to implement in AMPL/GNU Mathprog is that if x_it equals to 1 then x_i(t+1) and x_i(t+2) also should take value of 1. Is there a way to implement this constraint in a simple mathematical programming language?
The implication you want to implement is:
x(i,t) = 1 ==> x(i,t+1) = 1, x(i,t+2) = 1
AMPL supports implications (with the ==> operator), so we can write this directly. MathProg does not.
A simple way to implement the implication as straightforward linear inequalities is:
x(i,t+1) >= x(i,t)
x(i,t+2) >= x(i,t)
This can easily be expressed in AMPL, MathProg, or any modeling tool.
This is the pure, naive translation of the question. This means however that once a single x(i,t)=1 all following x(i,t+1),x(i,t+2),x(i,t+3)..=1. That could have been accomplished by just the constraint x(i,t+1) >= x(i,t).
A better interpretation would be: we don't want very short run lengths. I.e. patterns: 010 and 0110 are not allowed. This is sometimes called a minimum up-time in machine scheduling and can be modeled in different ways.
Forbid the patterns 010 and 0110:
(1-x(i,t-1))+x(i,t)+(1-x(i,t+1)) <= 2
(1-x(i,t-1))+x(i,t)+x(i,t+1)+(1-x(i,t+2)) <= 3
The pattern 01 implies 0111:
x(i,t+1)+x(i,t+2) >= 2*(x(i,t)-x(i,t-1))
Both these approaches will prevent patterns 010 and 0110 to occur.

Exploding gradient for gpflow SVGP

When optimizing a SVGP with Poisson Likelihood for a big data set I see what I think are exploding gradients.
After a few epochs I see a spiky drop of the ELBO, which then very slowly recovers after getting rid of all progress made before.
Roughly 21 iterations correspond to an Epoch.
This spike (at least the second one) resulted in a complete shift of the parameters (for vectors of parameters I just plotted the norm to see changes):
How can I deal with that? My first approach would be to clip the gradient, but that seems to require digging around the gpflow code.
My Setup:
Training works via Natural Gradients for the variational parameters and ADAM for the rest, with a slowly (linearly) increasing schedule for the Natural Gradient Gamma.
The batch and inducing point sizes are as large as possible for my setup
(both 2^12, with the data set consisting of ~88k samples). I include 1e-5 jitter and initialize the inducing points with kmeans.
I use a combined kernel, consisting of a combination of RBF, Matern52, a periodic and a linear kernel on a total of 95 features (a lot of them due to a one-hot encoding), all learnable.
The lengthscales are transformed with gpflow.transforms.
with gpflow.defer_build():
k1 = Matern52(input_dim=len(kernel_idxs["coords"]), active_dims=kernel_idxs["coords"], ARD=False)
k2 = Periodic(input_dim=len(kernel_idxs["wday"]), active_dims=kernel_idxs["wday"])
k3 = Linear(input_dim=len(kernel_idxs["onehot"]), active_dims=kernel_idxs["onehot"], ARD=True)
k4 = RBF(input_dim=len(kernel_idxs["rest"]), active_dims=kernel_idxs["rest"], ARD=True)
#
k1.lengthscales.transform = gpflow.transforms.Exp()
k2.lengthscales.transform = gpflow.transforms.Exp()
k3.variance.transform = gpflow.transforms.Exp()
k4.lengthscales.transform = gpflow.transforms.Exp()
m = gpflow.models.SVGP(X, Y, k1 + k2 + k3 + k4, gpflow.likelihoods.Poisson(), Z,
mean_function=gpflow.mean_functions.Constant(c=np.ones(1)),
minibatch_size=MB_SIZE, name=NAME)
m.mean_function.set_trainable(False)
m.compile()
UPDATE: Using only ADAM
Following the suggestion by Mark, I switched to ADAM only,
which helped me get rid of that sudden explosion. However, I still initialized with an epoch of natgrad only, which seems to save a lot of time.
In addition, the variational parameters seem to change a lot less abrupt (in terms of their norm at least). I guess they'll converge way slower now, but at least it's stable.
Just to add to Mark's answer above, when using nat grads in non-conjugate models it can take a bit of tuning to get the best performance, and instability is potentially a problem. As Mark points out, the large steps that provide potentially faster convergence can also lead to the parameters ending up in in bad regions of the parameter space. When the variational approximation is good (i.e. the true and approximate posterior are close) then there is good reason to expect that the nat grad will perform well, but unfortunately there is no silver bullet in the general case. See https://arxiv.org/abs/1903.02984 for some intuition.
This is very interesting. Perhaps trying to not use natgrads is a good idea as well. Clipping gradients indeed seems like a hack that could work. And yes, this would require digging around in the GPflow code a bit. One tip that can help towards this, is by not using the GPflow optimisers directly. The model._likelihood_tensor contains the TF tensor that should be optimised. Perhaps with some manual TensorFlow magic, you can do the gradient clipping on here before running an optimiser.
In general, I think this sounds like you've stumbled on an actual research problem. Usually these large gradients have a good reason in the model, which can be addressed with careful thought. Is it variance in some monte carlo estimate? Is the objective function behaving badly?
Regarding why not using natural gradients helps. Natural gradients use the Fisher matrix as a preconditioner to perform second order optimisation. Doing so can result in quite aggressive moves in parameter space. In certain cases (when there are usable conjugacy relations) these aggressive moves can make optimisation much faster. This case, with the Poisson likelihood, is not one where there are conjugacy relations that will necessarily help optimisation. In fact, the Fisher preconditioner can often be detrimental, particularly when variational parameters are not near the optimum.

Difference LP/MIP and CP

what is the difference between Constraint Programming (CP) and Linear Programming (LP) or Mixed Integer Programming (MIP) ? I know what LP and MIP is but dont understand the difference to CP - or is CP just the same as MIP and LP ? I am a but confused on this ...
This may be a little exhaustive, but I will try to provide all the information to cover a good scope of this topic.
I'll start with an example and the corresponding information will make more sense.
**Example**: Say we need to sequence a set of tasks on a machine. Each task i has a specific fixed processing time pi. Each task can be started after its release date ri , and must be completed before its deadline di. Tasks cannot overlap in time. Time is represented as a discrete set of time points, say {1, 2,…, H} (H stands for horizon)
MIP Model:
Variables: Binary variable xij represents whether task i starts at time period j
Constraints:
Each task starts on exactly one time point
* ∑j xij = 1 for all tasks i
Respect release date and deadline
j*xij = 0 for all tasks i and (j < ri ) or (j > di - pi )
Tasks cannot overlap
Variant 1:
∑i xij ≤ 1 for all time points j we also need to take processing times into account; this becomes messy
Variant 2:
introduce binary variable bi representing whether task i comes before task k must be linked to xij; this becomes messy
MIP models thus consists of linear/quadratic optimization functions, linear/ quadratic optimization constraints and binary/integer variables.
CP model:
Variables:
Let starti represent the starting time of task i takes a value from domain {1,2,…, H} - this immediately ensures that each task starts at exactly one time point
Constraints:
Respect release date and deadline
ri ≤ starti ≤ di - pi
Tasks cannot overlap:
for all tasks i and j (starti + pi < startj) OR (starti + pi < starti)
and that is it!
You could probably say that the structure of the CP models and MIP models are the same: using decision variables, objective function and a set of constraints. Both MIP and CP problems are non-convex and make use of some systematic and exhaustive search algorithms.
However, we see the major difference in modeling capacity. With CP we have n variables and one constraint. In MIP we have nm variables and n+m constraints. This way to map global constraints to MIP constraints using binary variables is quite generic
CP and MIP solves problems in a different way. Both use a divide and conquer approach, where the problem to be solved is recursively split into sub problems by fixing values of one variable at a time. The main difference lies in what happens at each node of the resulting problem tree. In MIP one usually solves a linear relaxation of the problem and uses the result to guide search. This is a branch and bound search. In CP, logical inferences based on the combinatorial nature of each global constraint are performed. This is an implicit enumeration search.
Optimization differences:
A constraint programming engine makes decisions on variables and values and, after each decision, performs a set of logical inferences to reduce the available options for the remaining variables' domains. In contrast, an mathematical programming engine, in the context of discrete optimization, uses a combination of relaxations (strengthened by cutting-planes) and "branch and bound."
A constraint programming engine proves optimality by showing that no better solution than the current one can be found, while an mathematical programming engine uses a lower bound proof provided by cuts and linear relaxation.
A constraint programming engine doesn't make assumptions on the mathematical properties of the solution space (convexity, linearity etc.), while an mathematical programming engine requires that the model falls in a well-defined mathematical category (for instance Mixed Integer Quadratic Programming (MIQP).
In deciding how you should define your problem - as MIP or CP, Google Optimization tools guide suggests: -
If all the constraints for the problem must hold for a solution to be feasible (constraints connected by "and" statements), then MIP is generally faster.
If many of the constraints have the property that just one of them needs to hold for a solution to be feasible (constraints connected by "or" statements), then CP is generally faster.
My 2 cents:
CP and MIP solves problems in a different way.  Both use a divide and conquer approach, where the problem to be solved is recursively split into sub problems by fixing values of one variable at a time.  The main difference lies in what happens at each node of the resulting problem tree.  In MIP one usually solves a linear relaxation of the problem and uses the result to guide search.  This is a branch and bound search.  In CP, logical inferences based on the combinatorial nature of each global constraint are performed.
There is no one specific answer to which approach would you use to formulate your model and solve the problem. CP would probably work better when the number of variables increase by a lot and the problem is difficult to formulate the constraints using linear equalities. If the MIP relaxation is tight, it can give better results - If you lower bound doesn't move enough while traversing your MIP problem, you might want to take higher degrees of MIP or CP into consideration. CP works well when the problem can be represented by Global constraints.
Some more reading on MIP and CP:
Mixed-Integer Programming problems has some of the decision variables constrained to integers (-n … 0 … n) at the optimal solution. This makes it easier to define the problems in terms of a mathematical program. MP focuses on special class of problems and is useful for solving relaxations or subproblems (vertical structure).
Example of a mathematical model:
Objective: minimize cT x
   Constraints: A x = b (linear constraints)
l ≤ x ≤ u (bound constraints)
some or all xj must take integer values (integrality constraints)
Or the model could be define by Quadratic functions or constraints, (MIQP/ MIQCP problems)
Objective: minimize xT Q x + qT x
   Constraints: A x = b (linear constraints)
l ≤ x ≤ u (bound constraints)
xT Qi x + qiT x ≤ bi (quadratic constraints)
some or all x must take integer values (integrality constraints)
The most common algorithm used to converge MIP problems is the Branch and Bound approach.
CP:
CP stems from a problems in AI, Operations Research and Computer Science, thus it is closely affiliated to Computer Programming.- Problems in this area assign symbolic values to variables that need to satisfy certain constraints.- These symbolic values have a finite domain and can be labelled with integers.- CP modelling language is more flexible and closer to natural language.
Quoted from one of the IBM docs, constraint Programming is a technology where:
business problems are modeled using a richer modeling language than what is traditionally found in mathematical optimization
problems are solved with a combination of tree search, artificial intelligence and graph theory techniques
The most common constraint(global) is the "alldifferent" constraint, which ensures that the decision variables assume some permutation (non-repeating ordering) of integer values. Ex. If the domain of the problem is 5 decision variables viz. 1,2,3,4,5, they can be ordered in any non-repetitive way.
The answer to this question depends on whether you see MIP and CP as algorithms, as problems, or as scientific fields of study.
E.g., each MIP problem is clearly a CP problem, as the definition of a MIP problem is to find a(n optimal) solution to a set of linear constraints, while the definition of a CP problem is to find a(n optimal) solution to a set of (non-specified) constraints. On the other hand, many important CP problems can straightforwardly be converted to sets of linear constraints, so seeing CP problems through a MIP perspective makes sense as well.
Algorithmically, CP algorithms historically tend to involve more search branching and complex constraint propagation, while MIP algorithms rely heavily on solving the LP relaxation to a problem. There exist hybrid algorithms though (e.g., SCIP, which literally means "Solving Constraint Integer Programs"), and state-of-the-art solvers often borrow techniques from the other side (e.g., no-good learning and backjumps originated in CP, but are now present in MIP solvers as well).
From a scientific field of study point of view, the difference is purely historical: MIP is part of Operations Research, originating at the end of WWII out of a need to optimize large-scale "operations", while CP grew out of logic programming in the field of Artificial Intelligence to model and solve problems declaratively. But there is a good case to be made that both these fields study the same problem. Note that there even is a big shared conference: CPAIOR.
So all in all, I would say MIP and CP are the same in most respects, except on the main techniques used in typical algorithms for each.
LP and MIP are solved using mathematical programming, while there are specific methods to solve constraint programming problems. The following reference is helpful in understanding the differences:
http://ibmdecisionoptimization.github.io/docplex-doc/mp_vs_cp.html

Comparing variable combinations using contrast or estimate in SAS

So, this should be an easy one, but I've always been garbage at contrasts, and the SAS literature isn't really helping. We are running an analysis, and we need to compare different combinations of variables. For example, we have 8 different breeds and 3 treatments, and want to contrast breed 5 against breed 7 at treatment 1. The code I have written is:
proc mixed data=data;
class breed treatment field;
model ear_mass = field breed field*breed treatment field*treatment breed*treatment;
random field*breed*treatment;
estimate "1 C0"
breed 0 0 0 0 1 0 -1 0 breed*treatment 0 0 0 0 1 0 0 0 -1 0 0;
run;
What exactly am I doing wrong in my estimate line that isn't working out?
Your contrast statement for this particular comparison must also include coefficients for breed*field.
When defining contrasts, I recommend starting small and building up. Write a contrast for breed 5 at time 1 (B5T1), and check its value against its lsmean to confirm that you've got the right coefficients. Note that you have to average over all field levels to get this estimate. Likewise, write a contrast for B7T1. Then subtract the coefficients for B5T1 from those for B7T1, noting that the coefficients for some terms (e.g., treatment*field) are now all zero.
An easier alternative is to use the LSMESTIMATE statement, which allows you to build contrasts using the lsmeans rather than the model parameters. See the documentation and this paper Kiernan et al., 2011, CONTRAST and ESTIMATE Statements Made Easy:The LSMESTIMATE Statement
Alas, you must tell SAS, it can't tell you.
You are right, it is easy to make an error. It is important to know the ordering of factor levels in the interaction, which is determined by the order of factors in the
CLASS statement. You can confirm the ordering by looking at the order of the interaction lsmeans in the LSMEANS table.
To check you can compute the estimate of the contrast by hand using the lsmeans. If it matches, then you can be confident that the standard error, and so the inferential test, are also correct.
The LSMESTIMATE is a really useful tool, faster and much less prone to error than defining contrasts using model parameters.

Unit testing cyclomatically complicated but otherwise trivial calculations

Let's say I have a calculator class who primary function is to do the following (this code is simplified to make the discussion easier, please don't comment on the style of it)
double pilingCarpetArea = (hardstandingsRequireRemediation = true) ? hardStandingPerTurbineDimensionA * hardStandingPerTurbineDimensionB * numberOfHardstandings * proportionOfHardstandingsRequiringGroundRemediationWorks : 0;
double trackCostMultipler;
if (trackConstructionType = TrackConstructionType.Easy) trackCostMultipler = 0.8
else if (trackConstructionType = TrackConstructionType.Normal) trackCostMultipler = 1
else if (trackConstructionType = TrackConstructionType.Hard) trackCostMultipler = 1.3
else throw new OutOfRangeException("Unknown TrackConstructionType: " + trackConstructionType.ToString());
double PilingCostPerArea = TrackCostPerMeter / referenceTrackWidth * trackCostMultipler;
There are at least 7 routes through this class I should probably test, the combination of trackCostMultiplier and hardstandingsRequireRemediation (6 combinations) and the exception condition. I might also want to add some for divide by zero and overflow and suchlike if I was feeling keen.
So far so good, I can test this number of combinations easily and stylishly. And actually I might trust that multiplication and addition are unlikely to go wrong, and so just have 3 tests for trackCostMultipler and 2 for hardstandingsRequireRemediation, instead of testing all possible combinations.
However, this is a simple case, and the logic in our apps is unfortunately cyclomatically much more complicated than this, so the number of tests could grow huge.
There are some ways to tackle this complexity
Extract the trackCostMultipler calculation to a method in the same class
This is a good thing to do, but it doesn't help me test it unless I make this method public, which is a form of "Test Logic In Production". I often do this in the name of pragmatism, but I would like to avoid if I can.
Defer the trackCostMultipler calculation to a different class
This seems like a good thing to do if the calculation is sufficiently complex, and I can test this new class easily. However I have just made the testing of the original class more complicated, as I will now want to pass in a ITrackCostMultipler "Test Double" of some sort, check that it gets called with the right parameters, and check that its return value is used correctly. When a class has, say, ten sub calculators, its unit / integration test becomes very large and difficult to understand.
I use both (1) and (2), and they give me confidence and they make debugging a lot quicker. However there are definitely downsides, such as Test Logic in Production and Obscure Tests.
I am wondering what others experiences of testing cyclomatically complicated code are? Is there a way of doing this without the downsides? I realise that Test Specific Subclasses can work around (1), but this seems like a legacy technique to me. It is also possible to manipulate the inputs so that various parts of the calculation return 0 (for addition or subtraction) or 1 (for multiplication or division) to make testing easier, but this only gets me so far.
Thanks
Cedd
Continuing the discussion from the comments to the OP, if you have referentially transparent functions, you can first test each small part by itself, and then combine them and test that the combination is correct.
Since constituent functions are referentially transparent, they are logically interchangeable with their return values. Now the only remaining step would be to prove that the overall function correctly composes the individual functions.
The is a great fit for property-based testing.
As an example, assume that you have two parts of a complex calculation:
module MyCalculations =
let complexPart1 x y = x + y // Imagine it's more complex
let complexPart2 x y = x - y // Imagine it's more complex
Both of these functions are deterministic, so assuming that you really want to test a facade function that composes these two functions, you can define this property:
open FsCheck.Xunit
open Swensen.Unquote
open MyCalculations
[<Property>]
let facadeReturnsCorrectResult (x : int) (y : int) =
let actual = facade x y
let expected = (x, y) ||> complexPart1 |> complexPart2 x
expected =! actual
Like other property-based testing frameworks, FsCheck will throw lots of randomly generated values at facadeReturnsCorrectResult (100 times, by default).
Given that both complexPart1 and complexPart2 are deterministic, but you don't know what x and y are, the only way to pass the test is to implement the function correctly:
let facade x y =
let intermediateResult = complexPart1 x y
complexPart2 x intermediateResult
You need another abstraction level to make your methods simpler, so it will be easier to test them:
doStuff(trackConstructionType, referenceTrackWidth){
...
trackCostMultipler = countTrackCostMultipler(trackConstructionType)
countPilingCostPerArea = countPilingCostPerArea(referenceTrackWidth, trackCostMultipler)
...
}
countTrackCostMultipler(trackConstructionType){
double trackCostMultipler;
if (trackConstructionType = TrackConstructionType.Easy) trackCostMultipler = 0.8
else if (trackConstructionType = TrackConstructionType.Normal) trackCostMultipler = 1
else if (trackConstructionType = TrackConstructionType.Hard) trackCostMultipler = 1.3
else throw new OutOfRangeException("Unknown TrackConstructionType: " + trackConstructionType.ToString());
return trackCostMultipler;
}
countPilingCostPerArea(referenceTrackWidth, trackCostMultipler){
return TrackCostPerMeter / referenceTrackWidth * trackCostMultipler;
}
Sorry for the code, I don't know the language, does not really matter...
If you don't want to make these methods public, then you have to move them to a separate class, and make them public there. The class name could be TrackCostMultiplerAlgorithm or ..Logic or ..Counter, or something like that. So you will be able to inject the algorithm into the higher abstraction level code if you'll have more different algorithms. Everything depends on the actual code.
Ohh and don't worry about the method and class lengths, if you really need a new method or class, because the code is too complex, then create one! Does not matter that it will be short. It will be always ease understanding as well, because you can write into the method name what it does. The code block inside the method only tells us how it does...