Question: Circulation problems allow you to have both a lower and an upper bound on the flow through a particular arc. The upper bound I understand (like pipes, there's only so much stuff that can go through). However, I'm having a difficult time understanding the lower bound idea. What does it mean? Will an algorithm for solving the problem...
try to make sure every arc with a lower bound will get at least that much flow, failing completely if it can't find a way?
simply disregard the arc if the lower bound can't be met? This would make more sense to me, but would mean there could be arcs with a flow of 0 in the resulting graph, i.e.
Context: I'm trying to find a way to quickly schedule a set of events, which each have a length and a set of possible times they can be scheduled at. I'm trying to reduce this problem to a circulation problem, for which efficient algorithms exist.
I put every event in a directed graph as a node, and supply it with the amount of time slots it should fill. Then I add all the possible times as nodes as well, and finally all the time slots, like this (all arcs point to the right):
The first two events have a single possible time and a length of 1, and the last event has a length of 4 and two possible times.
Does this graph make sense? More specifically, will the amount of time slots that get 'filled' be 2 (only the 'easy' ones) or six, like in the picture?
(I'm using a push-relabel algorithm from the LEMON library if that makes any difference.)
Regarding the general circulation problem:
I agree with #Helen; even though it may not be as intuitive to conceive of a practical use of a lower bound, it is a constraint that must be met. I don't believe you would be able to disregard this constraint, even when that flow is zero.
The flow = 0 case appeals to the more intuitive max flow problem (as pointed out by #KillianDS). In that case, if the flow between a pair of nodes is zero, then they cannot affect the "conservation of flow sum":
When no lower bound is given then (assuming flows are non-negative) a zero flow cannot influence the result, because
It cannot introduce a violation to the constraints
It cannot influence the sum (because it adds a zero term).
A practical example of a minimum flow could exist because of some external constraint (an associated problem requires at least X water go through a certain pipe, as pointed out by #Helen). Lower bound constraints could also arise from an equivalent dual problem, which minimizes the flow such that certain edges have lower bound (and finds an optimum equivalent to a maximization problem with an upper bound).
For your specific problem:
It seems like you're trying to get as many events done in a fixed set of time slots (where no two events can overlap in a time slot).
Consider the sets of time slots that could be assigned to a given event:
E1 -- { 9:10 }
E2 -- { 9:00 }
E3 -- { 9:20, 9:30, 9:40, 9:50 }
E3 -- { 9:00, 9:10, 9:20, 9:30 }
So you want to maximize the number of task assignments (i.e. events incident to edges that are turned "on") s.t. the resulting sets are pairwise disjoint (i.e. none of the assigned time slots overlap).
I believe this is NP-Hard because if you could solve this, you could use it to solve the maximal set packing problem (i.e. maximal set packing reduces to this). Your problem can be solved with integer linear programming, but in practice these problems can also be solved very well with greedy methods / branch and bound.
For instance, in your example problem. event E1 "conflicts" with E3 and E2 conflicts with E3. If E1 is assigned (there is only one option), then there is only one remaining possible assignment of E3 (the later assignment). If this assignment is taken for E3, then there is only one remaining assignment for E2. Furthermore, disjoint subgraphs (sets of events that cannot possibly conflict over resources) can be solved separately.
If it were me, I would start with a very simple greedy solution (assign tasks with fewer possible "slots" first), and then use that as the seed for a branch and bound solver (if the greedy solution found 4 task assignments, then bound if you recursive subtree of assignments cannot exceed 3). You could even squeeze out some extra performance by creating the graph of pairwise intersections between the sets and only informing the adjacent sets when an assignment is made. You can also update your best number of assignments as you continue the branch and bound (I think this is normal), so if you get lucky early, you converge quickly.
I've used this same idea to find the smallest set of proteins that would explain a set of identified peptides (protein pieces), and found it to be more than enough for practical problems. It's a very similar problem.
If you need bleeding edge performance:
When rephrased, integer linear programming can do nearly any variant of this problem that you'd like. Of course, in very bad cases it may be slow (in practice, it's probably going to work for you, especially if your graph is not very densely connected). If it doesn't, regular linear programming relaxations approximate the solution to the ILP and are generally quite good for this sort of problem.
Hope this helps.
The lower bound on the flow of an arc is a hard constraint. If the constraints can't be met, then the algorithm fails. In your case, they definitely can't be met.
Your problem can not be modeled with a pure network-flow model even with lower bounds. You are trying to get constraint that a flow is either 0 or at least some lower bound. That requires integer variables. However, the LEMON package does have an interface where you can add integer constraints. The flow out of each of the first layer of arcs must be either 0 or n where n is the number of required time-slots or you could say that at most one arc out of each "event" has nonzero flow.
Your "disjunction" constraint,
can be modeled as
f >= y * lower
f <= y * upper
with y restricted to being 0 or 1. If y is 0, then f can only be 0. If y is 1, the f can be any value between lower and upper. The mixed-integer programming algorithms will orders of magnitude slower than the network-flow algorithms, but they will model your problem.
Related
I'm using C++ to model a (maximization) MIP with CPLEX, and I specify a relative gap using
cplex.setParam(IloCplex::EpGap, gap);
I'm puzzled by the difference between
cplex.getBestObjValue();
and
cplex.getObjValue();
in case of early termination because of the gap.
If I understand correctly, the value from getBestObjValue() will always correspond to an integer feasible solution, and a lower bound to the optimal value. On the other hand, the value from getObjValue() (may? will always?) correspond to a non-feasible solution and is an upper bound to the optimal value. Am I understanding this correctly?
I also have another question: the value returned by getBestObjValue() is, in the case of maximization problems, 'the maximum objective function value of all remaining unexplored nodes' (from the CPLEX docs). Is there a a way to query the objective values of these unexplored nodes? I'm asking because I would like to get the minimal value that satisfies my relative gap, not the maximum.
According to the manual:
Cplex.GetBestObjValue Method:
It is computed for a minimization problem as the minimum objective function value of all remaining unexplored nodes. Similarly, it is computed for a maximization problem as the maximum objective function value of all remaining unexplored nodes.
For a regular MIP optimization, this value is also the best known bound on the optimal solution value of the MIP problem. In fact, when a problem has been solved to optimality, this value matches the optimal solution value.
It corresponds to an upper bound (when maximising) of the objective value, there is a gap when you stop the solver before reaching optimality. In MIP, there is branch and bound tree behind, as more nodes are explored, the upper bound decreases. There might or might not be any solution matching the upper bound when you stop by epgap.
Therefore your assumption below is wrong:
If I understand correctly, the value from getBestObjValue() will always correspond to an integer feasible solution.
GetObjValue() on the other hand is the objective value of the current best solution (corresponding to a found feasible solution). It is a lower bound, this is the value that you want to use in your second question.
I am trying to schedule a certain number of events in the week according to certain constraints, and would like to spread out these events as evenly as possible throughout the week.
If I add the standard deviation of the intervals between events to the objective function, then CPLEX can minimise it.
I am struggling to define the standard deviation of the intervals in terms of CPLEX expressions, mainly because the events don't have to be in any particular sequence, and I don't know which event is prior to any other one.
I feel sure this must be a solved problem, but I have not been able to find help in IBM's cplex documentation or on the internet.
Scheduling Uniformly Spaced Events
Here a few possible ideas for you to try:
Let t0, t1, t2, t3 ... tn be the event times. (These are variables chosen by the model.)
Let d1 = t1-t0, d2=t2-t1 etc... dn.
Goal: We want all these d's to be roughly equal, which would have the effect of roughly evenly spacing out the t's.
Options
Option 1: Put a cost on the deviation from ideal
Let us take one example. Let's say that you want to schedule 10 events in a week (168 hours.) With no other constraint except
equal spacing, we could have the first event start at time=0, and the last one end at time t=168. The others would be 168/(10-9) =~ 18.6 hours apart. Let's call this d_ideal.
We don't want d's to be much less than d_ideal (18.6) or much greater than d_ideal.
That is, in the objective, add Cost_dev * (abs(d_ideal - dj))
(You have to create two variable for each d (d+ and d-to handle the absolute values in the objective function.)
Option 1a
In the method above, all deviations are priced the same. So the model doesn't care if it deviates by 3 hours, or two deviations
of 1.5 hours each. The way to handle that is to add step-wise costs. Small cost for small deviations, with very high cost for high deviations. (You make them step-wise linear so that the formulation stays an LP/IP)
Option 2: Max-min
This is around your minimize the std. deviation of d's idea. We want to maximize each d (increase the inter-event separation.)
But we would also hugely punish (big cost) that particular d value that is the greatest. (In English, we don't want to let
any single d to get too large)
This is the MinMax idea. (Minimize the maximum d value, but also maximize individual d's)
Option 3: Two LPs: Solve first, then move the events around in a second LP
One drawback of layering on more and more of these side constraints is that the formulation becomes complicated.
To address this, I have seen two (or more) passes used. You solve the base LP first, assign events and then in another
LP, you address the issue of uniformly distributing times.
The goal of the second LP is to move the events around, without breaking any hard constraints.
Option 3a: Choose one of many "copies"
To achieve this, we use the following idea:
We allow multiple possible time slots for an event, and make the model select one.
The Event e1 (currently assigned to time t1) is copied into (say) 3 other possible slots.
e11 + e12 + e13 + e14 = 1
The second model can choose to move the event to a "better" time slot, or leave it be. (The old solution
is always feasible.)
The reason you are not seeing much in CPLEX manuals is that these are all formulation ideas. If you search for job- or event-scheduling
using LP's, you will come across a few pdf's that mighe be helpful.
I am programming a C++ simulation application in which several mass-spring structures will move and collide and I'm currently struggling with the collision detection and response part. These structures might or might not be closed (it might be a "ball" or just a chain of masses and springs) so it is (I think) impossible to use a "classic" approach where we test for 2 overlapping shapes.
Furthermore, the collisions are a really important part of that simulation and I need them to be as accurate as possible, both in terms of detection and response (real time is not a constraint here). I want to be able to know the forces applied to each Nodes (the masses), as far as possible.
At the moment I am detecting collisions between Nodes and springs at each time step, and the detection seems to work. I can compute the time of collision between one node and a spring, and thus find the exact position of the collision. However, I am not sure if this is the right approach for this problem and after lots of researches I can't quite find a way to make things work properly, mainly on the response aspect of the collisions.
Thus I would really like to hear from any technique, algorithm or library that seems well suited for this kind of collisions problems or any idea you might have to make this work. Really, any kind of help will be greatly appreciated.
If you can meet the following conditions:
0) All collisions are locally binary - that is to say
collisions only occur for pairs of particles, not triples etc,
1) you can predict the future time for a collision between
objects i and j from knowledge of their dynamics (assuming that no other
collision occurs first)
2) you know how to process the physics/dynamicseac of the collision
then you should be able to do the following:
Let Tpq be the predicted time for a collision between particles p and q, and Vp (Vq) be a structure holding the local dynamics of each particle p (q) (i.e its velocity, position, spring-constants, whatever)
For n particles...
Initialise by calculating all Tpq (p,q in 1..n)
Store the n^2 values of Tpq in a Priority Queue (PQ)
repeat
extract first Tpq from the PQ
Advance the time to Tpq
process the collision (i.e. update Vp and Vq according to your dynamics)
remove all Tpi, and Tiq (i in 1..n) from the PQ
// these will be invalid now as the changes in Vp, Vq means the
// previously calculated collision of p and q with any other particle
// i might occur sooner, later or not at all
recalculate new Tpi and Tiq (i in 1..n) and insert in the PQ
until done
There is an o(n^2) initial setup cost, but the repeat loop should be O(nlogn) - the cost of removing and replacing the 2n-1 invalidated collisions. This is fairly efficient for moderate numbers of particles (up to hundreds). It has the benefit that you only need to process things at collision time, rather than for equally spaced time steps. This makes things particularly efficient for a sparsely populated simulation.
I guess an octree approach would do best with your problem. An octree devides the virtual space into several recursive leaves of a tree and lets you compute possible collisions between the most probable nodes.
Here a short introduction: http://www.flipcode.com/archives/Introduction_To_Octrees.shtml :)
If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.
Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.
The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.
The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)