I have encountered several optimization problems that involve identifying one or more indices in a vector that maximizes or minimizes a cost. Is there a way to identify such indices in linear programming? I'm open to solutions in mathprog, CVXR, CVXPY, or any other API.
For example, identifying an index is needed for change point problems (find the index at which the function changes), putting distance constraints on the traveling salesman problem (visit city X before cumulative distance Y).
As a simple example, suppose we want to identify the location in a vector where the sum on either side is the most equal (their difference is smallest). In this example, the solution is index 5:
x = c(1, 3, 6, 4, 7, 9, 6, 2, 3)
Attempt 1
Using CVXR, I tried declaring split_index and using that as an index (e.g., x[1:split]):
library(CVXR)
split_index = Variable(1, integer = TRUE)
objective = Minimize(abs(sum(x[1:split_index]) - sum(x[(split_index+1):length(x)])))
result = solve(objective)
It errs 1:split_index with NA/NaN argument.
Attempt 2
Declare an explicit index-vector (indices) and do an elementwise logical test whether split_index <= indices. Then element-wise-multiply that binary vector with x to select one or the other side of the split:
indices = seq_along(x)
split_index = Variable(1, integer = TRUE)
is_first = split_index <= indices
objective = Minimize(abs(sum(x * is_first) - sum(x * !is_first)))
result = solve(objective)
It errs in x * is_first with non-numeric argument to binary operator. I suspect that this error arises because is_first is now an IneqConstraint object.
Symbols in red are decision variables and symbols in blue are constants.
R code:
> library(Rglpk)
> library(CVXR)
>
> x <- c(1, 3, 6, 4, 7, 9, 6, 2, 3)
> n <- length(x)
> delta <- Variable(n, boolean=T)
> y <- Variable(2)
> order <- list()
> for (i in 2:n) {
+ order[[as.character(i)]] <- delta[i-1] <= delta[i]
+ }
>
>
> problem <- Problem(Minimize(abs(y[1]-y[2])),
+ c(order,
+ y[1] == t(1-delta) %*% x,
+ y[2] == t(delta) %*%x))
> result <- solve(problem,solver = "GLPK", verbose=T)
GLPK Simplex Optimizer, v4.47
30 rows, 12 columns, 60 non-zeros
0: obj = 0.000000000e+000 infeas = 4.100e+001 (2)
* 7: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 8: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
30 rows, 12 columns, 60 non-zeros
9 integer variables, none of which are binary
Integer optimization begins...
+ 8: mip = not found yet >= -inf (1; 0)
+ 9: >>>>> 1.000000000e+000 >= 0.000000000e+000 100.0% (2; 0)
+ 9: mip = 1.000000000e+000 >= tree is empty 0.0% (0; 3)
INTEGER OPTIMAL SOLUTION FOUND
> result$getValue(delta)
[,1]
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 1
[7,] 1
[8,] 1
[9,] 1
> result$getValue(y)
[,1]
[1,] 21
[2,] 20
>
The absolute value is automatically linearized by CVXR.
At the end of the day, if you are selecting things by index, I think you need to work this with a set of corresponding binary selection variables. The fact that you are selecting "things in a row" as in your example problem is just something that needs to be handled with constraints on the binary variables.
To solve the problem you posed, I made a set of binary selection variables, call it s[i] where i = {0, 1, 2, ..., len(x)} and then constrained:
s[i] <= s[i-1] for i = {1, 2, ..., len(x)}
which enforces the "continuity" from the start up to the first non-selection and then thereafter.
My solution is in Python. LMK if you'd like me to post. The concept above, I think, is what you are asking about.
Related
I am given this algorithmic problem, and need to find a way to return the count in a list S and another list L that is between some variable x and some variable y, inclusive, that runs in O(1) time:
I've issued a challenge against Jack. He will submit a list of his favorite years (from 0 to 2020). If Jack really likes a year,
he may list it multiple times. Since Jack comes up with this list on the fly, it is in no
particular order. Specifically, the list is not sorted, nor do years that appear in the list
multiple times appear next to each other in the list.
I will also submit such a list of years.
I then will ask Jack to pick a random year between 0 and 2020. Suppose Jack picks the year x.
At the same time, I will also then pick a random year between 0 and 2020. Suppose I
pick the year y. Without loss of generality, suppose that x ≤ y.
Once x and y are picked, Jack and I get a very short amount of time (perhaps 5
seconds) to decide if we want to re-do the process of selecting x and y.
If no one asks for a re-do, then we count the number of entries in Jack's list that are
between x and y inclusively and the number of entries in my list that are between x and
y inclusively.
More technically, here is the situation. You are given lists S and L of m and n integers,
respectively, in the range [0, k], representing the collections of years selected by Jack and
I. You may preprocess S and L in O(m+n+k) time. You must then give an algorithm
that runs in O(1) time – so that I can decide if I need to ask for a re-do – that solves the
following problem:
Input: Two integers, x as a member of [0,k] and y as a member of [0,k]
Output: the number of entries in S in the range [x, y], and the number of entries in L in [x, y].
For example, suppose S = {3, 1, 9, 2, 2, 3, 4}. Given x = 2 and y = 3, the returned count
would be 4.
I would prefer pseudocode; it helps me understand the problem a bit easier.
Implementing the approach of user3386109 taking care of edge case of x = 0.
user3386109 : Make a histogram, and then compute the accumulated sum for each entry in the histogram. Suppose S={3,1,9,2,2,3,4} and k is 9. The histogram is H={0,1,2,2,1,0,0,0,0,1}. After accumulating, H={0,1,3,5,6,6,6,6,6,7}. Given x=2 and y=3, the count is H[y] - H[x-1] = H[3] - H[1] = 5 - 1 = 4. Of course, x=0 is a corner case that has to be handled.
# INPUT
S = [3, 1, 9, 2, 2, 3, 4]
L = [2, 9, 4, 6, 8, 5, 3]
k = 9
x = 2
y = 3
# Histogram for S
S_hist = [0]*(k+1)
for element in S:
S_hist[element] = S_hist[element] + 1
# Storing prefix sum in S_hist
sum = S_hist[0]
for index in range(1,k+1):
sum = sum + S_hist[index]
S_hist[index] = sum
# Similar approach for L
# Histogram for L
L_hist = [0] * (k+1)
for element in L:
L_hist[element] = L_hist[element] + 1
# Stroing prefix sum in L_hist
sum = L_hist[0]
for index in range(1,k+1):
sum = sum + L_hist[index]
L_hist[index] = sum
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in S:")
if(x == 0):
print(S_hist[y])
else:
print(S_hist[y] - S_hist[x-1])
# Finding number of elements between x and y (inclusive) in S
print("number of elements between x and y (inclusive) in L:")
if(x == 0):
print(L_hist[y])
else:
print(L_hist[y] - L_hist[x-1])
I am a beginner in R studio, so hopefully someone can help me with this problem. The case: I want to make an if else loop. I made the following code for an l times m matrix:
for (i in 1:l){
for (j in 1:m){
if (is.na(quantilereturns[i,j]) < quantile(quantilereturns[,j], c(.1), na.rm=TRUE)) {
quantilereturns[i,j]
} else { (0) }
}
}
Summary: I want to make a matrix with values that are smaller than the quantile of a certain vector in the matrix quantilereturns. So when they are smaller than the 10% quantile they get their original value otherwise it will be a zero.
The code doesn't give any errors, but it doesn't change the values in the matrix either.
Can someone help me?
You need to assign the result to a cell of the matrix. I will take the matrix of a recent other thread as an example:
a <- c(4, -9, 2)
b <- c(-1, 3, -8)
c <- c(5, 2, 6)
d <- c(7, 9, -2)
matrix <- cbind(a,b,c,d)
d <- dim(matrix)
rows <- d[1]
columns <- d[2]
print("Before")
print(matrix)
for (i in 1:rows) {
for (j in 1:columns) {
if (is.na(matrix[i,j]) >= quantile(matrix[,j], c(.1), na.rm=TRUE)) {
matrix[i,j] <- 0
}
}
}
print("After")
print(matrix)
this gives
[1] "Before"
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[1] "After"
a b c d
[1,] 0 0 5 0
[2,] 0 0 2 0
[3,] 0 0 6 0
So the essential line you are looking for is matrix[i,j] <- 0
I saw a question on careercup, but I do not get the answer I want there. I wrote an answer myself and want your comment on my analysis of time complexity and comment on the algorithm and code. Or you could provide a better algorithm in terms of time. Thanks.
You are given d > 0 fair dice with n > 0 "sides", write an function that returns a histogram of the frequency of the result of dice rolls.
For example, for 2 dice, each with 3 sides, the results are:
(1, 1) -> 2
(1, 2) -> 3
(1, 3) -> 4
(2, 1) -> 3
(2, 2) -> 4
(2, 3) -> 5
(3, 1) -> 4
(3, 2) -> 5
(3, 3) -> 6
And the function should return:
2: 1
3: 2
4: 3
5: 2
6: 1
(my sol). The time complexity if you use a brute force depth first search is O(n^d). However, you can use the DP idea to solve this problem. For example, d=3 and n=3. You can use the result of d==1 when computing d==2:
d==1
num #
1 1
2 1
3 1
d==2
first roll second roll is 1
num # num #
1 1 2 1
2 1 -> 3 1
3 1 4 1
first roll second roll is 2
num # num #
1 1 3 1
2 1 -> 4 1
3 1 5 1
first roll second roll is 3
num # num #
1 1 4 1
2 1 -> 5 1
3 1 6 1
Therefore,
second roll
num #
2 1
3 2
4 3
5 2
6 1
The time complexity of this DP algorithm is
SUM_i(1:d) {n*[n(d-1)-(d-1)+1]} ~ O(n^2*d^2)
~~~~~~~~~~~~~~~ <--eg. d=2, n=3, range from 2~6
The code is written in C++ as follows
vector<pair<int,long long>> diceHisto(int numSide, int numDice) {
int n = numSide*numDice;
vector<long long> cur(n+1,0), nxt(n+1,0);
for(int i=1; i<=numSide; i++) cur[i]=1;
for(int i=2; i<=numDice; i++) {
int start = i-1, end = (i-1)*numSide; // range of previous sum of rolls
//cout<<"start="<<start<<" end="<<end<<endl;
for(int j=1; j<=numSide; j++) {
for(int k=start; k<=end; k++)
nxt[k+j] += cur[k];
}
swap(cur,nxt);
for(int j=start; j<=end; j++) nxt[j]=0;
}
vector<pair<int,long long>> result;
for(int i=numDice; i<=numSide*numDice; i++)
result.push_back({i,cur[i]});
return result;
}
You can do it in O(n*d^2). First, note that the generating function for an n-sided dice is p(n) = x+x^2+x^3+...+x^n, and that the distribution for d throws has generating function p(n)^d. Representing the polynomials as arrays, you need O(nd) coefficients, and multiplying by p(n) can be done in a single pass in O(nd) time by keeping a rolling sum.
Here's some python code that implements this. It has one non-obvious optimisation: it throws out a factor x from each p(n) (or equivalently, it treats the dice as having faces 0,1,2,...,n-1 rather than 1,2,3,...,n) which is why d is added back in when showing the distribution.
def dice(n, d):
r = [1] + [0] * (n-1) * d
nr = [0] * len(r)
for k in xrange(d):
t = 0
for i in xrange(len(r)):
t += r[i]
if i >= n:
t -= r[i-n]
nr[i] = t
r, nr = nr, r
return r
def show_dist(n, d):
for i, k in enumerate(dice(n, d)):
if k: print i + d, k
show_dist(6, 3)
The time and space complexity are easy to see: there's nested loops with d and (n-1)*d iterations so the time complexity is O(n.d^2), and there's two arrays of size O(nd) and no other allocation, so the space complexity is O(nd).
Just in case, here a simple example in Python using the OpenTurns platform.
import openturns as ot
d = 2 # number of dice
n = 6 # number of sides per die
# possible values
dice_distribution = ot.UserDefined([[i] for i in range(1, n + 1)])
# create the sum distribution d times the sum
sum_distribution = sum([dice_distribution] * d)
That's it!
print(sum_distribution)
will show you all the possible values and their corresponding probabilities:
>>> UserDefined(
{x = [2], p = 0.0277778},
{x = [3], p = 0.0555556},
{x = [4], p = 0.0833333},
{x = [5], p = 0.111111},
{x = [6], p = 0.138889},
{x = [7], p = 0.166667},
{x = [8], p = 0.138889},
{x = [9], p = 0.111111},
{x = [10], p = 0.0833333},
{x = [11], p = 0.0555556},
{x = [12], p = 0.0277778}
)
You can also draw the probability distribution function:
sum_distribution.drawPDF()
I have n data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l of length n assigning each point to a cluster. Values of l ranges from 0 to (possibly) n-1.
Example:
l_1 = [ 1 1 1 0 0 2 6 ]
Is a partition of n=7 points into 4 clusters: first three points are clustered together, the fourth and fifth are together and the last two points forms two distinct singleton clusters.
My question:
Suppose I have two partitions l_1 and l_2 how can I efficiently determine if they represents identical partitions?
Example:
l_2 = [ 2 2 2 9 9 3 1 ]
is identical to l_1 since it represents the same partitions of the points (despite the fact that the "numbers"/"labels" of the clusters are not identical).
On the other hand
l_3 = [ 2 2 2 9 9 3 3 ]
is no longer identical since it groups together the last two points.
I'm looking for a solution in either C++, python or Matlab.
Unwanted direction
A naive approach would be to compare the co-occurrence matrix
c1 = bsxfun( #eq, l_1, l_1' );
c2 = bsxfun( #eq, l_2, l_2' );
l_1_l_2_are_identical = all( c1(:)==c2(:) );
The co-occurrence matrix c1 is of size nxn with true if points k and m are in the same cluster and false otherwise (regardless of the cluster "number"/"label").
Therefore if the co-occurrence matrices c1 and c2 are identical then l_1 and l_2 represent identical partitions.
However, since the number of points, n, might be quite large I would like to avoid O(n^2) solutions...
Any ideas?
Thanks!
When are two partition identical?
Probably if they have the exact same members.
So if you just want to test for identity, you can do the following:
Substitute each partition ID with the smallest object ID in the partition.
Then two partitionings are identical if and only if this representation is identical.
In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form
[ 1 1 1 4 4 6 7 ]
^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
^ first occurrence at pos 4
for l_1 and l_2, whereas l_3 canonicalizes to
[ 1 1 1 4 4 6 6 ]
To make it more clear, here is another example:
l_4 = [ A B 0 D 0 B A ]
canonicalizes to
[ 1 2 3 4 3 2 1 ]
since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.
If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.
Update: Since it was claimed that this is quadratic, I will further clarify.
To produce the canonical form, use the following approach (actual python code):
def canonical_form(li):
""" Note, this implementation overwrites li """
first = dict()
for i in range(len(li)):
v = first.get(li[i])
if v is None:
first[li[i]] = i
v = i
li[i] = v
return li
print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False
If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).
Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list.
You will need to be able to map partition labels to counter numbers.
If the counters for each list do not match, then the partitions do not match.
This would be O(n).
Here is a proof of concept in Python:
l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]
l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]
l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]
d1 = dict()
d2 = dict()
c1 = []
c2 = []
# assume lists same length
match = True
for i in range(len(l_1)):
if l_1[i] not in d1:
x1 = len(c1)
d1[l_1[i]] = x1
c1.append(1)
else:
x1 = d1[l_1[i]]
c1[x1] += 1
if l_2[i] not in d2:
x2 = len(c2)
d2[l_2[i]] = x2
c2.append(1)
else:
x2 = d2[l_2[i]]
c2[x2] += 1
if x1 != x2 or c1[x1] != c2[x2]:
match = False
print "match = {}".format(match)
In matlab:
function tf = isIdenticalClust( l_1, l_2 )
%
% checks if partitions l_1 and l_2 are identical or not
%
tf = all( accumarray( {l_1} , l_2 , [],#(x) all( x == x(1) ) ) == 1 ) &&...
all( accumarray( {l_2} , l_1 , [],#(x) all( x == x(1) ) ) == 1 );
What this does:
groups all elements of l_1 according to the partition of l_2 and checks if all elements of l_1 at each cluster are all identical. Repeating the same for partitioning l_2 according to l_1.
If both grouping yields the homogenous clusters - they are identical.
consider that
0 -- is the first
1 -- is the second
2 -- is the third
.....
9 -- is the 10th
11 -- is the 11th
what is an efficient algorithm to find the nth palindromic number?
I'm assuming that 0110 is not a palindrome, as it is 110.
I could spend a lot of words on describing, but this table should be enough:
#Digits #Pal. Notes
0 1 "0" only
1 9 x with x = 1..9
2 9 xx with x = 1..9
3 90 xyx with xy = 10..99 (in other words: x = 1..9, y = 0..9)
4 90 xyyx with xy = 10..99
5 900 xyzyx with xyz = 100..999
6 900 and so on...
The (nonzero) palindromes with even number of digits start at p(11) = 11, p(110) = 1001, p(1100) = 100'001,.... They are constructed by taking the index n - 10^L, where L=floor(log10(n)), and append the reversal of this number: p(1101) = 101|101, p(1102) = 102|201, ..., p(1999) = 999|999, etc. This case must be considered for indices n >= 1.1*10^L but n < 2*10^L.
When n >= 2*10^L, we get the palindromes with odd number of digits, which start with p(2) = 1, p(20) = 101, p(200) = 10001 etc., and can be constructed the same way, using again n - 10^L with L=floor(log10(n)), and appending the reversal of that number, now without its last digit: p(21) = 11|1, p(22) = 12|1, ..., p(99) = 89|8, ....
When n < 1.1*10^L, subtract 1 from L to be in the correct setting with n >= 2*10^L for the case of an odd number of digits.
This yields the simple algorithm:
p(n) = { L = logint(n,10);
P = 10^(L - [1 < n < 1.1*10^L]); /* avoid exponent -1 for n=1 */
n -= P;
RETURN( n * 10^L + reverse( n \ 10^[n >= P] ))
}
where [...] is 1 if ... is true, 0 else, and \ is integer division.
(The expression n \ 10^[...] is equivalent to: if ... then n\10 else n.)
(I added the condition n > 1 in the exponent to avoid P = 10^(-1) for n=0. If you use integer types, you don't need this. Another choice it to put max(...,0) as exponent in P, or use if n=1 then return(0) right at the start. Also notice that you don't need L after assigning P, so you could use the same variable for both.)