So, this should be an easy one, but I've always been garbage at contrasts, and the SAS literature isn't really helping. We are running an analysis, and we need to compare different combinations of variables. For example, we have 8 different breeds and 3 treatments, and want to contrast breed 5 against breed 7 at treatment 1. The code I have written is:
proc mixed data=data;
class breed treatment field;
model ear_mass = field breed field*breed treatment field*treatment breed*treatment;
random field*breed*treatment;
estimate "1 C0"
breed 0 0 0 0 1 0 -1 0 breed*treatment 0 0 0 0 1 0 0 0 -1 0 0;
run;
What exactly am I doing wrong in my estimate line that isn't working out?
Your contrast statement for this particular comparison must also include coefficients for breed*field.
When defining contrasts, I recommend starting small and building up. Write a contrast for breed 5 at time 1 (B5T1), and check its value against its lsmean to confirm that you've got the right coefficients. Note that you have to average over all field levels to get this estimate. Likewise, write a contrast for B7T1. Then subtract the coefficients for B5T1 from those for B7T1, noting that the coefficients for some terms (e.g., treatment*field) are now all zero.
An easier alternative is to use the LSMESTIMATE statement, which allows you to build contrasts using the lsmeans rather than the model parameters. See the documentation and this paper Kiernan et al., 2011, CONTRAST and ESTIMATE Statements Made Easy:The LSMESTIMATE Statement
Alas, you must tell SAS, it can't tell you.
You are right, it is easy to make an error. It is important to know the ordering of factor levels in the interaction, which is determined by the order of factors in the
CLASS statement. You can confirm the ordering by looking at the order of the interaction lsmeans in the LSMEANS table.
To check you can compute the estimate of the contrast by hand using the lsmeans. If it matches, then you can be confident that the standard error, and so the inferential test, are also correct.
The LSMESTIMATE is a really useful tool, faster and much less prone to error than defining contrasts using model parameters.
Related
I am meant to find the the variable that accounts for the highest percentage of variance in dependent variable SalePrice. I use the following code to find that the natural log of the SalePrice had the highest r-square value of over 90%:
proc reg data = proj.team4;
model SalePrice = Log_Price;
title "Regression of SalePrice and Log_Price";
run;
The varaible that came in second was Overall_Qual with a r-square value of about 50%.
I must write a prediction equation that will predict the value of SalePrice given the best significant predictor. Even though the Log_price r-square is much higher, am I right to think that the Overall_Qual variable should be used in the prediction equation instead?
My reasoning being that an equation that looks like this:
SalePrice = m(ln(SalePrice)) + b
would never predict the value of SalePrice since the value of SalePrice is needed in order to carry out the equation.
Though this mathematically makes some sense to try and fit any function over a domain by a linear function, this is statistical nonsense, because it is not telling anything about what determines SalesPrice.
Reduced cost give the dual variable corresponding to a box constraint of a variable as was pointed out in the related answer to this question: LP: postive reduced costs corresponding to positive variables?
How do I know whether the lower or upper bound is the active constraint? Of course, I could check whether the difference between the variable value and its bounds is smaller than some epsilon. However, the choice of such an epsilon is unclear to me, because a model could potentially attempt to fix a variable by setting the lower bound equal to the upper bound. In such a case, no epsilon could unambiguously indicate which of the bounds is the active one.
Does cplex provide the information which bound is active in its C++ api? Does any other LP solver do so? Is there another trick to figure out the active bound?
Thank you.
To a large extent you can look at the sign. The rule for reduced costs is:
Basic Non-basic-at-LB Non-basic-at-UB
minimization 0 >= 0 <= 0
maximization 0 <= 0 >= 0
Some problems with this:
Not all solvers may follow this (especially when maximizing).
Degeneracy may make things difficult
Most solvers will give you access to the basis status. E.g. Cplex has BasisStatus which gives you exactly the basis status of a variable: Basic, AtLower, AtUpper or FreeOrSuperbasic.
Consider two variables sex and age, where sex is treated as categorical variable and age is treated as continuous variable. I want to specify the full factorial using the non-interacted age-term as baseline (the default for sex##c.age is to omit one of the sex-categories).
The best I could come up with, so far, is to write out the factorial manually (and omitting `age' from the regression):
reg y sex#c.age i.sex
This is mathematically equivalent to
reg y sex##c.age
but allows me to directly infer the regression coefficients (and standard errors!) on the interaction term sex*age for both genders.
Is there a way to stick to the more economic "##" notation but make the continuous variable the omitted category?
(I know the manual approach has little overhead notation in the example given here, but the overhead becomes huge when working with triple-interaction terms).
Not a complete answer, but a workaround: One can use `lincom' to obtain linear combinations of coefficients and standard errors. E.g., one could obtain the combined effects for both sexes as follows:
reg y sex##c.age
forval i = 1/2 {
lincom age + `i'.sex#c.age
}
I am new to Cplex. I'm solving an integer programming problem, but I have a problem with objective function.
Problem is that I have some project, that have a due date D, and if project is tardy, than I have a tardiness penalty b, so it looks like b*(cn-D). Where cn is a real complection time of a project, and it is decision variable.
It must look like this
if (cn-D)>=0 then b*(cn-D)==0
I tried to use "if-then" constraint, but seems that it isn't working with decision variable.
I looked on question similar to this, but but could not find solution. Please, help me define correct objective function.
The standard way to model this is:
min sum(i, penalty(i)*Tardy(i))
Tardy(i) >= CompletionTime(i) - DueDate(i)
Tardy(i) >= 0
Tardy is a non-negative variable and can never become negative. The other quantities are:
penalty: a constant indicating the cost associated with job i being tardy one unit of time.
CompletionTime: a variable that holds the completion time of job i
DueDate: a constant with the due date of job i.
The above measures the sum. Sometimes we also want to measure the count: the number of jobs that are tardy. This is to prevent many jobs being tardy. In the most general case one would have both the sum and the count in the objective with different weights or penalties.
There is a virtually unlimited number of papers showing MIP formulations on scheduling models that involve tardiness. Instead of reinventing the wheel, it may be useful to consult some of them and see what others have done to formulate this.
I have a proc discrim statement which runs a KNN analysis. When I set k = 1 then it assigns everything a category (as expected). But when k > 1 it leaves some observations unassigned (sets category as Other).
I'm assuming this is a result of deadlock votes for two or more of the categories. I know there are ways around this by either taking a random one of the deadlocked votes as the answer, or taking the nearest of the deadlocked votes as the answer.
Is this functionality available in proc discrim? How do you tell it how to deal with deadlocks?
Cheers!
Your assumption that the assignment of an observation to the "Other" class results from the same probability of assignment to two or more of the designated classes is correct when the number of nearest neighbors is two or more. You can see this by specifying the PROC DISCRIM statement option, OUT=SASdsn, to write a SAS output data set of how well the procedure classified the input observations. This output data set contains probabilities for assignment to each of the designated classes. For example, using two nearest neighbors (K=2) with the iris data set yields five observations that the procedure classifies as ambiguous, with a probability of 0.50 for being assigned to either the Versicolor or the Virginica class. From the output data set, you can select these ambiguously classified observations and assign them randomly to these classes in a subsequent DATA step. Or, you can compare the values of the variables used to classify these ambiguously classified observations to the means of these values for each of the classes, perhaps by calculating a squared distance +/- standardized by the standard deviation of each value and by assigning the observation to the "closest" class.