Scheduling - Spread out assigned event times evenly - linear-programming

I am trying to schedule a certain number of events in the week according to certain constraints, and would like to spread out these events as evenly as possible throughout the week.
If I add the standard deviation of the intervals between events to the objective function, then CPLEX can minimise it.
I am struggling to define the standard deviation of the intervals in terms of CPLEX expressions, mainly because the events don't have to be in any particular sequence, and I don't know which event is prior to any other one.
I feel sure this must be a solved problem, but I have not been able to find help in IBM's cplex documentation or on the internet.

Scheduling Uniformly Spaced Events
Here a few possible ideas for you to try:
Let t0, t1, t2, t3 ... tn be the event times. (These are variables chosen by the model.)
Let d1 = t1-t0, d2=t2-t1 etc... dn.
Goal: We want all these d's to be roughly equal, which would have the effect of roughly evenly spacing out the t's.
Options
Option 1: Put a cost on the deviation from ideal
Let us take one example. Let's say that you want to schedule 10 events in a week (168 hours.) With no other constraint except
equal spacing, we could have the first event start at time=0, and the last one end at time t=168. The others would be 168/(10-9) =~ 18.6 hours apart. Let's call this d_ideal.
We don't want d's to be much less than d_ideal (18.6) or much greater than d_ideal.
That is, in the objective, add Cost_dev * (abs(d_ideal - dj))
(You have to create two variable for each d (d+ and d-to handle the absolute values in the objective function.)
Option 1a
In the method above, all deviations are priced the same. So the model doesn't care if it deviates by 3 hours, or two deviations
of 1.5 hours each. The way to handle that is to add step-wise costs. Small cost for small deviations, with very high cost for high deviations. (You make them step-wise linear so that the formulation stays an LP/IP)
Option 2: Max-min
This is around your minimize the std. deviation of d's idea. We want to maximize each d (increase the inter-event separation.)
But we would also hugely punish (big cost) that particular d value that is the greatest. (In English, we don't want to let
any single d to get too large)
This is the MinMax idea. (Minimize the maximum d value, but also maximize individual d's)
Option 3: Two LPs: Solve first, then move the events around in a second LP
One drawback of layering on more and more of these side constraints is that the formulation becomes complicated.
To address this, I have seen two (or more) passes used. You solve the base LP first, assign events and then in another
LP, you address the issue of uniformly distributing times.
The goal of the second LP is to move the events around, without breaking any hard constraints.
Option 3a: Choose one of many "copies"
To achieve this, we use the following idea:
We allow multiple possible time slots for an event, and make the model select one.
The Event e1 (currently assigned to time t1) is copied into (say) 3 other possible slots.
e11 + e12 + e13 + e14 = 1
The second model can choose to move the event to a "better" time slot, or leave it be. (The old solution
is always feasible.)
The reason you are not seeing much in CPLEX manuals is that these are all formulation ideas. If you search for job- or event-scheduling
using LP's, you will come across a few pdf's that mighe be helpful.

Related

Interval scheduling but with travelling and location

So I have been thinking long and hard about this question and I think this might good old fashioned interval scheduling by finding the earliest deadline first.
Here's my approach,
Calculate the deadlines for each item (time it falls off the branch + the time it takes for the item to fall from the branch - the time it requires for me to travel there and catch it)
If the deadline is earlier than the current time then, I cant catch it.
However, in the ones that are reachable, I catch the earliest one. Then I again calculate the earliest deadlines for every item this time feeding in my new position.
However, this approach seems a little inefficient. Can someone point me toward a better one?
After parsing and storing your input, you'll have some array of points in 2 dimensions (x and t) you should try to visit, no more than one x per t. You need to find the optimal path through that array.
One approach would be to brute-force this. You have (2*s)^t possible paths - less, in fact, since you shouldn't move beyond 0 - l. So for very small t, you could really find the optimal path - i.e. try every possible path, see how many things you can get to for each path, and return number of things from the best path.
However, if t gets big, then a brute-force approach would take some time (sub-exponential time).
You will need to find some approximation to this if you're going to code this up. There are loads of algorithms with different trade-offs, e.g. https://en.wikipedia.org/wiki/Best-first_search .
Priority queues, as found in schedulers, could indeed be useful to this as you rightly state.

Fast adding random variables in C++

Short version: how to most efficiently represent and add two random variables given by lists of their realizations?
Mildly longer version:
for a workproject, I need to add several random variables each of which is given by a list of values. For example, the realizations of rand. var. A are {1,2,3} and the realizations of B are {5,6,7}. Hence, what I need is the distribution of A+B, i.e. {1+5,1+6,1+7,2+5,2+6,2+7,3+5,3+6,3+7}. And I need to do this kind of adding several times (let's denote this number of additions as COUNT, where COUNT might reach 720) for different random variables (C, D, ...).
The problem: if I use this stupid algorithm of summing each realization of A with each realization of B, the complexity is exponential in COUNT. Hence, for the case where each r.v. is given by three values, the amount of calculations for COUNT=720 is 3^720 ~ 3.36xe^343 which will last till the end of our days to calculate:) Not to mention that in real life, the lenght of each r.v. is gonna be 5000+.
Solutions:
1/ The first solution is to use the fact that I am OK with rounding, i.e. having integer values of realizations. Like this, I can represent each r.v. as a vector and for at the index corresponding to a realization I have a value of 1 (when the r.v. has this realization once). So for a r.v. A and a vector of realizations indexed from 0 to 10, the vector representing A would be [0,1,1,1,0,0,0...] and the representation for B would be [0,0,0,0,0,1,1,1,0,0,10]. Now I create A+B by going through these vectors and do the same thing as above (sum each realization of A with each realization of B and codify it into the same vector structure, quadratic complexity in vector length). The upside of this approach is that the complexity is bound. The problem of this approach is that in real applications, the realizations of A will be in the interval [-50000,50000] with a granularity of 1. Hence, after adding two random variables, the span of A+B gets to -100K, 100K.. and after 720 additions, the span of SUM(A, B, ...) gets to [-36M, 36M] and even quadratic complexity (compared to exponential complexity) on arrays this large will take forever.
2/ To have shorter arrays, one could possibly use a hashmap, which would most likely reduce the number of operations (array accesses) involved in A+B as the assumption is that some non-trivial portion of the theoreical span [-50K, 50K] will never be a realization. However, with continuing summing of more and more random variables, the number of realizations increases exponentially while the span increases only linearly, hence the density of numbers in the span increases over time. And this would kill the hashmap's benefits.
So the question is: how can I do this problem efficiently? The solution is needed for calculating a VaR in electricity trading where all distributions are given empirically and are like no ordinary distributions, hence formulas are of no use, we can only simulate.
Using math was considered as the first option as half of our dept. are mathematicians. However, the distributions that we're going to add are badly behaved and the COUNT=720 is an extreme. More likely, we are going to use COUNT=24 for a daily VaR. Taking into account the bad behaviour of distributions to add, for COUNT=24 the central limit theorem would not hold too closely (the distro of SUM(A1, A2, ..., A24) would not be close to normal). As we're calculating possible risks, we'd like to get a number as precise as possible.
The intended use is this: you have hourly casflows from some operation. The distribution of cashflows for one hour is the r.v. A. For the next hour, it's r.v. B, etc. And your question is: what is the largest loss in 99 percent of cases? So you model the cashflows for each of those 24 hours and add these cashflows as random variables so as to get a distribution of the total casfhlow over the whole day. Then you take the 0.01 quantile.
Try to reduce the number of passes required to make the whole addition, possibly reducing it to a single pass for every list, including the final one.
I don't think you can cut down on the total number of additions.
In addition, you should look into parallel algorithms and multithreading, if applicable.
At this point, most processors are able to perform additions in parallel, given proper instrucions (SSE), which will make the additions many times faster(still not a cure for the complexity problem).
As you said in your question, you're going to need an awful lot of computation to get the exact answer. So it's not going to happen.
However, as you're dealing with random values, it would be possible to apply some mathmatics to the problem. Wouldn't the result of all these additions result in something that approaches the normal distribution? For example, consider rolling a single dice. Each number has equal probability so the realisations don't follow a normal distribution (actually, they probably do, there was a program on BBC4 last week about it and it showed that lottery balls had a normal distribution to their appearance). However, if you roll two dice and sum them, then the realisations do follow a normal distribution. So I think the result of your computation is going to approximate a normal distribution so it becomes a problem of finding the average value and the sigma value for a given set of inputs. You can workout the upper and lower bounds for each input as well as their averages and I'm sure a bit of Googling will provide methods for applying functions to normal distributions.
I guess there is a corollary question and that is what the results are used for? Knowing how the results are used will inform the decision on how the results are created.
Ignoring the programmatic solutions, you can cut down the total number of additions quite significantly as your data set grows.
If we define four groups W, X, Y and Z, each with three elements, by your own maths this leads to a large number of operations:
W + X => 9 operations
(W + X) + Y => 27 operations
(W + X + Y) + Z => 81 operations
TOTAL: 117 operations
However, if we assume a strictly-ordered definition of your "add" operation so that two sets {a,b} and {c,d} always result in {a+c,a+d,b+c,b+d} then your operation is associative. That means that you can do this:
W + X => 9 operations
Y + Z => 9 operations
(W + X) + (Y + Z) => 81 operations
TOTAL: 99 operations
This is a saving of 18 operations, for a simple case. If you extend the above to 6 groups of 3 members, the total number of operations can be dropped from 1089 to 837 - almost 20% saving. This improvement is more pronounced the more data you have (more sets or more elements will give more savings).
Further, this opens the problem to better parallelisation: if you have 200 groups to process, you can start by combining the 100 pairs in parallel, then the 50 pairs or results, then 25, etc. This will allow a large degree of parallelism that should give you much better performance. (For example, 720 sets would be added in ~10 parallel operations as each parallel add will allow increasing COUNT by a factor of 2.)
I'm absolutely no expert on this, but it would seem an ideal problem for using the parallel procesing capability of a typical GPU - my understanding is that something like CUDA would make short work of processing all these calculations in parallel.
EDIT: If your real question is "what's your largest loss" then this is a much easier problem. Given that every value in the ultimate set is the sum of one value from each "component" set, your biggest loss will generally be found by combining the lowest value from each component set. Finding these lower values (one value per set) is a much simpler job, and you then only need sum together that limited set of values.
There are basically two methods. An approximative one and an exact one...
Approximative method models the sum of random variables by a lot of samplings. Basically, having random variables A, B we randomly sample from each r.v. 50K times, add the sampled values (here SSE can help a lot) and we have a distribution of A+B. This is how mathematicians would do this in Mathematica.
Exact method utilizes something Dan Puzey proposed, namely summing only some small portion of each r.v.'s density. Let's say we have random variables with the following "densities" (where each value is of the same likelihood for simplicity sake)
A = {-5,-3,-2}
B = {+0,+1,+2}
C = {+7,+8,+9}
The sum of A+B+C is going to be
{2,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9}
and if I want to know the whole distribution precisely, I have no other choice than summing each elem of A with each elem of B and then each elem of this sum with each elem of C. However, if I only want the 99% VaR of this sum, i.e. 1% percentile of this sum, I only have to sum the smallest elements of A,B,C.
More precisely, I will take nA,nB,nC smallest elements from each distribution. To determine nA,nB,nC let's set these to 1 first. Then, increase nA by one if A[nA] = min( A[nA], B[nB], C[nC]) (counting on that A,B,C are sorted). This way, I can get the nA, nB, nC smallest elements of A,B,C which I will have to sum together (each with each other) and take the X-th smallest sum (where X is 1% multiplied by total combination count of sums, i.e. 3*3*3 for A,B,C). This also tells when to stop increasing nA,nB,nC - stop when nA*nB*nC > X.
However, like this I am doing the same redundancy again, i.e. I am calculating the whole distribution of A+B+C left of the 1% percentile. Even this will be MUCH shorter than calculating the whole distro of A+B+C, however. But I believe there should be a simple iterative algo to tell exaclty the the given VaR number in O(a*b) where a is the number of added r.v.s and b is the max number of elements in the density of each r.v.
I will be glad for any comments on whether I am correct.

What does the 'lower bound' in circulation problems mean?

Question: Circulation problems allow you to have both a lower and an upper bound on the flow through a particular arc. The upper bound I understand (like pipes, there's only so much stuff that can go through). However, I'm having a difficult time understanding the lower bound idea. What does it mean? Will an algorithm for solving the problem...
try to make sure every arc with a lower bound will get at least that much flow, failing completely if it can't find a way?
simply disregard the arc if the lower bound can't be met? This would make more sense to me, but would mean there could be arcs with a flow of 0 in the resulting graph, i.e.
Context: I'm trying to find a way to quickly schedule a set of events, which each have a length and a set of possible times they can be scheduled at. I'm trying to reduce this problem to a circulation problem, for which efficient algorithms exist.
I put every event in a directed graph as a node, and supply it with the amount of time slots it should fill. Then I add all the possible times as nodes as well, and finally all the time slots, like this (all arcs point to the right):
The first two events have a single possible time and a length of 1, and the last event has a length of 4 and two possible times.
Does this graph make sense? More specifically, will the amount of time slots that get 'filled' be 2 (only the 'easy' ones) or six, like in the picture?
(I'm using a push-relabel algorithm from the LEMON library if that makes any difference.)
Regarding the general circulation problem:
I agree with #Helen; even though it may not be as intuitive to conceive of a practical use of a lower bound, it is a constraint that must be met. I don't believe you would be able to disregard this constraint, even when that flow is zero.
The flow = 0 case appeals to the more intuitive max flow problem (as pointed out by #KillianDS). In that case, if the flow between a pair of nodes is zero, then they cannot affect the "conservation of flow sum":
When no lower bound is given then (assuming flows are non-negative) a zero flow cannot influence the result, because
It cannot introduce a violation to the constraints
It cannot influence the sum (because it adds a zero term).
A practical example of a minimum flow could exist because of some external constraint (an associated problem requires at least X water go through a certain pipe, as pointed out by #Helen). Lower bound constraints could also arise from an equivalent dual problem, which minimizes the flow such that certain edges have lower bound (and finds an optimum equivalent to a maximization problem with an upper bound).
For your specific problem:
It seems like you're trying to get as many events done in a fixed set of time slots (where no two events can overlap in a time slot).
Consider the sets of time slots that could be assigned to a given event:
E1 -- { 9:10 }
E2 -- { 9:00 }
E3 -- { 9:20, 9:30, 9:40, 9:50 }
E3 -- { 9:00, 9:10, 9:20, 9:30 }
So you want to maximize the number of task assignments (i.e. events incident to edges that are turned "on") s.t. the resulting sets are pairwise disjoint (i.e. none of the assigned time slots overlap).
I believe this is NP-Hard because if you could solve this, you could use it to solve the maximal set packing problem (i.e. maximal set packing reduces to this). Your problem can be solved with integer linear programming, but in practice these problems can also be solved very well with greedy methods / branch and bound.
For instance, in your example problem. event E1 "conflicts" with E3 and E2 conflicts with E3. If E1 is assigned (there is only one option), then there is only one remaining possible assignment of E3 (the later assignment). If this assignment is taken for E3, then there is only one remaining assignment for E2. Furthermore, disjoint subgraphs (sets of events that cannot possibly conflict over resources) can be solved separately.
If it were me, I would start with a very simple greedy solution (assign tasks with fewer possible "slots" first), and then use that as the seed for a branch and bound solver (if the greedy solution found 4 task assignments, then bound if you recursive subtree of assignments cannot exceed 3). You could even squeeze out some extra performance by creating the graph of pairwise intersections between the sets and only informing the adjacent sets when an assignment is made. You can also update your best number of assignments as you continue the branch and bound (I think this is normal), so if you get lucky early, you converge quickly.
I've used this same idea to find the smallest set of proteins that would explain a set of identified peptides (protein pieces), and found it to be more than enough for practical problems. It's a very similar problem.
If you need bleeding edge performance:
When rephrased, integer linear programming can do nearly any variant of this problem that you'd like. Of course, in very bad cases it may be slow (in practice, it's probably going to work for you, especially if your graph is not very densely connected). If it doesn't, regular linear programming relaxations approximate the solution to the ILP and are generally quite good for this sort of problem.
Hope this helps.
The lower bound on the flow of an arc is a hard constraint. If the constraints can't be met, then the algorithm fails. In your case, they definitely can't be met.
Your problem can not be modeled with a pure network-flow model even with lower bounds. You are trying to get constraint that a flow is either 0 or at least some lower bound. That requires integer variables. However, the LEMON package does have an interface where you can add integer constraints. The flow out of each of the first layer of arcs must be either 0 or n where n is the number of required time-slots or you could say that at most one arc out of each "event" has nonzero flow.
Your "disjunction" constraint,
can be modeled as
f >= y * lower
f <= y * upper
with y restricted to being 0 or 1. If y is 0, then f can only be 0. If y is 1, the f can be any value between lower and upper. The mixed-integer programming algorithms will orders of magnitude slower than the network-flow algorithms, but they will model your problem.

Hard sorting problem - what type of algorithm should I be using?

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.
Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?
If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.
A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.
A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.
Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.
Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX