Why Gurobi generates slack variables? - linear-programming

I have the following LP:
max -x4-x5-x6-x7
s.t x0+x1+x4=1
x2+x3+x5=1
x0+x2+x6=1
x1+x3+x7=1
Gurobi gives me the following base B=[1,0,2,10], in my model I have 8 variables and rank(A)=4, but in the base I have variable x10. My question is, why Gurobi generates slack variables even with rank(A)=4? And how to get an optimal base that contains only the original variables (variables from x0 to x7)?

The problem is degenerate. There are multiple optimal bases that give the same primal solution. In other words, some variables are basic at bound. This happens a lot in practice and you should not worry about that.
To make things more complicated: there are also multiple optimal primal solutions; so we say that we have both primal and dual degeneracy.
My question is, why Gurobi generates slack variables even with rank(A)=4?
The LP problem, for solvers like Gurobi (i.e. not a tableau solver), has n structural variables and m logical variables (a.k.a. slacks). The slack variables are implicit, they are not "generated" (in the sense that the matrix A is physically augmented with an identity matrix). Again, this is not something to worry about.
And how to get an optimal base that contains only the original variables (variables from x0 to x7)?
Well, this is an optimal basis. So why would Gurobi spend time and do more pivots to try to make all slacks non-basic? AFAIK no solver would do that. They treat structural and logical variables as equals.
It is not so easy to force variables to be in the basis. A free variable will most likely be in the (optimal) basis (but no 100% guarantee). You can also specify an advanced basis for Gurobi to start from. In the extreme: if this advanced basis is optimal (and feasible) Gurobi will not do any pivots.
I believe this particular problem has 83 optimal (and feasible) bases. Only one of them has all slacks NB. I don't think it is easy to find this solution, even if you would have access to a Simplex code and can change it so (after finding an optimal solution) it continues pivoting slacks out of the basis. I think you would need to enumerate optimal bases (explicitly or implicitly).

Slack variables are generated because it is necessary to solve for a dual linear programming model and reduction bound. A complementary slackness; Si would also be generated.
Moreover, X0 forms a branch for linear independence of a dual set-up. X0 has the value of 4, slack variables are generated from the transformation of the basis; where the rank is given, to the branch X0 for linear independence.
A reduction matrix is formed to find the value of X10 which has the value of 5/2.
This helps to eliminate X10 inorder to get an optimal base from the reduction matrix.

Related

Initialize and warmstart parameters on Pyomo

If we initialize a variable model.x at a specific value (i.e. model.x = 1) before solving the model, do we need to have warmstart=True as parameter for the call of Pyomo's solve() method in order to keep those initial values for the optimization?
Keep in mind that an initialized variable should not be forced to take the specified value, it only provides the variable with an initial starting value and then the solver will change it if needed.
For now, it depends on the solver interface.
If you are using a solver through the NL-file interface (e.g., AMPL solvers), then initial variable values are always supplied to the solver (if they are not None), and it is up to the solver whether or not it attempts to use those values as a warmstart (e.g., for a MIP) or an initial iterate (e.g., for solvers that are using an optimization method that requires a starting point). For solvers that require a starting point, it is also up to the solver what value will be used for any variables that are not supplied a starting point. Often times zero is used, but this may vary between solvers.
For all other Pyomo solver interfaces (e.g., LP, MPS, Python), which mainly correspond to MIP solvers, I believe the default behavior is to not supply a warmstart. You have to specify warmstart=True when you call solve to have initial values communicated to the solver.
I do not find this consistent, mainly because when going through the the NL-file interface, the solve method does not even accept the warmstart keyword, so you have to have an if-statement when writing some general code that works with multiple interfaces.
I think I'll save further discussion for the GitHub issue.

ELKI: Running LOF with varying k

Can I run LOF with varying k through ELKI so that it is easy to compare which k is the best?
Normally you choose a k, and then you can see the ROCAUC for example. I want to take out the best k for the data set, so I need to compare multiple runs. Can I do that some way easier than manually changing the value for k and doing runs? I want to for example compare all k=[1-100].
Thanks
The Greedy Ensemble shows how to run outlier detection methods for a whole range of k at once efficiently (by only computing the nearest-neighbors once, it will be a lot faster!) using the ComputeKNNOutlierScores application included with ELKI.
The application EvaluatePrecomputedOutlierScores can be used to bulk-evaluate these results with multiple measures.
This is what we used for the publication
G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
On the supplementary material website, you can look up the best results for many standard data sets, as well as download the raw results.
But beware that outlier detection quality results tend to be inconclusive. On one data set, one method performs best, on another data set another method. There is no clear winner, because data sets are very diverse.

Query re. how to set up an SVM, which SVM variation … and how to define a metric

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.
Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.
The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

What does the 'lower bound' in circulation problems mean?

Question: Circulation problems allow you to have both a lower and an upper bound on the flow through a particular arc. The upper bound I understand (like pipes, there's only so much stuff that can go through). However, I'm having a difficult time understanding the lower bound idea. What does it mean? Will an algorithm for solving the problem...
try to make sure every arc with a lower bound will get at least that much flow, failing completely if it can't find a way?
simply disregard the arc if the lower bound can't be met? This would make more sense to me, but would mean there could be arcs with a flow of 0 in the resulting graph, i.e.
Context: I'm trying to find a way to quickly schedule a set of events, which each have a length and a set of possible times they can be scheduled at. I'm trying to reduce this problem to a circulation problem, for which efficient algorithms exist.
I put every event in a directed graph as a node, and supply it with the amount of time slots it should fill. Then I add all the possible times as nodes as well, and finally all the time slots, like this (all arcs point to the right):
The first two events have a single possible time and a length of 1, and the last event has a length of 4 and two possible times.
Does this graph make sense? More specifically, will the amount of time slots that get 'filled' be 2 (only the 'easy' ones) or six, like in the picture?
(I'm using a push-relabel algorithm from the LEMON library if that makes any difference.)
Regarding the general circulation problem:
I agree with #Helen; even though it may not be as intuitive to conceive of a practical use of a lower bound, it is a constraint that must be met. I don't believe you would be able to disregard this constraint, even when that flow is zero.
The flow = 0 case appeals to the more intuitive max flow problem (as pointed out by #KillianDS). In that case, if the flow between a pair of nodes is zero, then they cannot affect the "conservation of flow sum":
When no lower bound is given then (assuming flows are non-negative) a zero flow cannot influence the result, because
It cannot introduce a violation to the constraints
It cannot influence the sum (because it adds a zero term).
A practical example of a minimum flow could exist because of some external constraint (an associated problem requires at least X water go through a certain pipe, as pointed out by #Helen). Lower bound constraints could also arise from an equivalent dual problem, which minimizes the flow such that certain edges have lower bound (and finds an optimum equivalent to a maximization problem with an upper bound).
For your specific problem:
It seems like you're trying to get as many events done in a fixed set of time slots (where no two events can overlap in a time slot).
Consider the sets of time slots that could be assigned to a given event:
E1 -- { 9:10 }
E2 -- { 9:00 }
E3 -- { 9:20, 9:30, 9:40, 9:50 }
E3 -- { 9:00, 9:10, 9:20, 9:30 }
So you want to maximize the number of task assignments (i.e. events incident to edges that are turned "on") s.t. the resulting sets are pairwise disjoint (i.e. none of the assigned time slots overlap).
I believe this is NP-Hard because if you could solve this, you could use it to solve the maximal set packing problem (i.e. maximal set packing reduces to this). Your problem can be solved with integer linear programming, but in practice these problems can also be solved very well with greedy methods / branch and bound.
For instance, in your example problem. event E1 "conflicts" with E3 and E2 conflicts with E3. If E1 is assigned (there is only one option), then there is only one remaining possible assignment of E3 (the later assignment). If this assignment is taken for E3, then there is only one remaining assignment for E2. Furthermore, disjoint subgraphs (sets of events that cannot possibly conflict over resources) can be solved separately.
If it were me, I would start with a very simple greedy solution (assign tasks with fewer possible "slots" first), and then use that as the seed for a branch and bound solver (if the greedy solution found 4 task assignments, then bound if you recursive subtree of assignments cannot exceed 3). You could even squeeze out some extra performance by creating the graph of pairwise intersections between the sets and only informing the adjacent sets when an assignment is made. You can also update your best number of assignments as you continue the branch and bound (I think this is normal), so if you get lucky early, you converge quickly.
I've used this same idea to find the smallest set of proteins that would explain a set of identified peptides (protein pieces), and found it to be more than enough for practical problems. It's a very similar problem.
If you need bleeding edge performance:
When rephrased, integer linear programming can do nearly any variant of this problem that you'd like. Of course, in very bad cases it may be slow (in practice, it's probably going to work for you, especially if your graph is not very densely connected). If it doesn't, regular linear programming relaxations approximate the solution to the ILP and are generally quite good for this sort of problem.
Hope this helps.
The lower bound on the flow of an arc is a hard constraint. If the constraints can't be met, then the algorithm fails. In your case, they definitely can't be met.
Your problem can not be modeled with a pure network-flow model even with lower bounds. You are trying to get constraint that a flow is either 0 or at least some lower bound. That requires integer variables. However, the LEMON package does have an interface where you can add integer constraints. The flow out of each of the first layer of arcs must be either 0 or n where n is the number of required time-slots or you could say that at most one arc out of each "event" has nonzero flow.
Your "disjunction" constraint,
can be modeled as
f >= y * lower
f <= y * upper
with y restricted to being 0 or 1. If y is 0, then f can only be 0. If y is 1, the f can be any value between lower and upper. The mixed-integer programming algorithms will orders of magnitude slower than the network-flow algorithms, but they will model your problem.