Find basic and non basic variables - linear-programming

The objective row of a tableau is 2x+u+5v+z=21, where z is the objective function variable.For each x, u, v, decide if it is a basic or non-basic variables in the tableau.Are all non-basic variables(NBV) of the LP the NBV?

Interpret the coefficients of the objective row in a tableau as the reduced cost of the corresponding column.
We know the following: if a variable is basic, then its corresponding reduced cost is zero. (If a variable is non-basic the reduced cost is likely nonzero, but it can be zero.)
This means: if the reduced cost is non-zero, the variable must be non-basic.
So (x,u,v) are non-basic.
I don't understand the sentence: Are all non-basic variables(NBV) of the LP the NBV? so I cannot help with that.

Related

How to optimize nonlinear funtion with some constraint in c++

I want to find a variable in C++ that allows a given nonlinear formula to have a maximum value in a constraint.
It is to calculate the maximum value of the formula below in the given constraint in C++.
You can also use the library.(e.g. Nlopt)
formula : ln(1+ax+by+c*z)
a, b, c are numbers input by the user
x, y, z are variables to be derived
variable constraint is that x, y, z are positive and x+y+z<=1
This can actually be transformed into a linear optimization problem.
max ln(1+ax+by+cz) <--> max (ax+by+cz) s.t. ax+by+cz > -1
This means that it is a linear optimization problem (with one more constraint) that you can easily handle with whatever C++ methods together with your Convex Optimization knowledge.
Reminders to write a good code:
Check the validity of input value.
Since the input value can be negative, you need to consider this circumstance which can yield different results.
P.S.
This problem seems to be Off Topic on SO.
If it is your homework, it is for your own good to write the code yourself. Besides, we do not have enough time to write that for you.
This should have been a comment if I had more reputation.

stata: inequality constraint in xttobit

Is it possible to constrain parameters in Stata's xttobit to be non-negative? I read a paper where the authors said they did just that, and I am trying to work out how.
I know that you can constrain parameters to be strictly positive by exponentially transforming the variables (e.g. gen x1_e = exp(x1)) and then calling nlcom after estimation (e.g. nlcom exp(_b[x1:_y]) where y is the independent variable. (That may not be exactly right, but I am pretty sure the general idea is correct. Here is a similar question from Statlist re: nlsur).
But what would a non-negative constraint look like? I know that one way to proceed is by transforming the variables, for example squaring them. However, I tried this with the author's data and still found negative estimates from xttobit. Sorry if this is a trivial question, but it has me a little confused.
(Note: this was first posted on CV by mistake. Mea culpa.)
Update: It seems I misunderstand what transformation means. Suppose we want to estimate the following random effects model:
y_{it} = a + b*x_{it} + v_i + e_{it}
where v_i is the individual random effect for i and e_{it} is the idiosyncratic error.
From the first answer, would, say, an exponential transformation to constrain all coefficients to be positive look like:
y_{it} = exp(a) + exp(b)*x_{it} + v_i + e_{it}
?
I think your understanding of constraining parameters by transforming the associated variable is incorrect. You don't transform the variable, but rather you fit your model having reexpressed your model in terms of transformed parameters. For more details, see the FAQ at http://www.stata.com/support/faqs/statistics/regression-with-interval-constraints/, and be prepared to work harder on your problem than you might have expected to, since you will need to replace the use of xttobit with mlexp for the transformed parameterization of the tobit log-likelihood function.
With regard to the difference between non-negative and strictly positive constraints, for continuous parameters all such constraints are effectively non-negative, because (for reasonable parameterization) a strictly positive constraint can be made arbitrarily close to zero.

Enumerate some extreme points near optimum solution

I am looking for a simple way to obtain a lot of "good" solutions in a LP problem (not MIP) with CPLEX, and not only (one of the) optimal basic solution(s). By "good" solutions I mean that the corresponding objective values are not so far from the real optimal value. Such pool of solutions could help the decision-maker...
More precisely, given a certain polyedron Ax<=b with x>=0 and an objective function z=cx I want to maximize, after running the LP, I can obtain the optimal value z*. Then I want to enumerate all the extreme points of the polyhedron given by the set of constraints
Ax <= b
cx >= z* - epsilon
x >= 0
when epsilon is a given tolerance.
I know that CPLEX offers way to generate solution pool (see here), but it will not function because this method is for MIP : it enumerates all the solutions of an IP (or one solution for every given set of fixed integer variables if the problem is a MIP).
An interesting efficient way is to visit the adjacent solutions of the optimal basic solution, i.e. all the adjacent extreme points : if I suppose the polyhedron is not degenerative, for each pair of basic variable x_B and non-basic variable x_N, I compute the basic solution obtained when x_B leaves the basis and x_N enters in the basis. Then I throw the solutions with cx < z*-epsilon, and for the others I repeat the procedure. [I know that I could improve this algorithm, but this is the general idea].
The routine CPPXpivot of the Callable Library could help to do this pivoting operation, but I did not find an equivalent in the C++ API (concert technology). Does someone know if such an equivalent exist, or could propose me an other way to answer my original problem ?
Thanks a lot :) !
RĂ©mi L.
There is one interesting way to make this suitable for use with the Cplex solution pool. Use binary variables to encode the current basis, e.g. basis[k] = 0 meaning nonbasic and basis[k] = 1 indicating variable (or row) k is basic. Of course we have sum(k, basis[k]) = m (number of rows). Finally we have x[k] <= basis[k] * upperbound[k] (i.e. if nonbasic then zero -- assuming positive variables). When we add this to the LP model we end up with a MIP and can enumerate (all or some, optimal or near optimal) bases using the Cplex solution pool. See here and here.

Transform a 1 by 1 matrix intro a variable or scalar

How do transform a 1 by 1 matrix intro a variable or scalar? At the moment I have two matrices which are both 1 by 1, so in principle they are scalars. I would like to divide one of the values (which is the 1 by 1 matrix) by the other value (which is the other 1 by 1 matrix).
I've read that one can do something like that
C[`i',`j']= A[`i',`j']/B[`i',`j']
to do element by element operations in Stata. In this example one would loop over i and j. Unfortunately, it did not work.
In Stata, variables and scalars are two different things. Variables are set up as columns in a Stata database; almost always the subject of some statistical analysis. A scalar is a storage type that holds some expression, be it numeric or string.
The code you show appears to be from this page: http://www.stata.com/support/faqs/data-management/element-by-element-operations-on-matrices/, but you only post one part. That part makes use of local macros, but nowhere do you seem to define them. Furthermore, if you have a matrix with only one element, then you need not loop over the indices of the matrix. Its only element is held in position [1,1].
Below is an example of two matrices with one element each, whose division is saved in a scalar.
clear all
set more off
matrix A = (1)
matrix B = (2)
scalar c = A[1,1]/B[1,1]
display "scalar c is: " c
Stata has its own matrix language, Mata, in case you need "advanced" matrix features.
See at least help macro, help scalar, help matrix, help forvalues and help mata.

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?
I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.
Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).
Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.