When writing a Prediction Equation, can the natural log of the dependent variable be used as a significant predictor? - sas

I am meant to find the the variable that accounts for the highest percentage of variance in dependent variable SalePrice. I use the following code to find that the natural log of the SalePrice had the highest r-square value of over 90%:
proc reg data = proj.team4;
model SalePrice = Log_Price;
title "Regression of SalePrice and Log_Price";
run;
The varaible that came in second was Overall_Qual with a r-square value of about 50%.
I must write a prediction equation that will predict the value of SalePrice given the best significant predictor. Even though the Log_price r-square is much higher, am I right to think that the Overall_Qual variable should be used in the prediction equation instead?
My reasoning being that an equation that looks like this:
SalePrice = m(ln(SalePrice)) + b
would never predict the value of SalePrice since the value of SalePrice is needed in order to carry out the equation.

Though this mathematically makes some sense to try and fit any function over a domain by a linear function, this is statistical nonsense, because it is not telling anything about what determines SalesPrice.

Related

Residual values such as 4e^-14 randomly appearing in Cplex LP model solution: How can I get rid of them?

While the result of the my linear programming model in Cplex seems to make sense, the q variable sometimes randomly (at least to me it seems random) shows tiny values such as 4e^-14. This doesn't have an effect on the decision variable but is still very irritating as I am not sure if something in my model isn't correct. You can see the results of the q variable with the mini residuals here: Results q variable. These residuals only started appearing in my model once I introduced binary variables.
q is defined as: dexpr float q [t in Years, i in Options] = (c[i] * (a[t+s[i]][i]-a[t+s[i]-1][i]));
a is a decision variable
This is a constraint q is subject to: q[i][t] == a[i] * p[i]* y[t][i])
Since y is a binary variable, q should either be the value of a[i] * p[i] or 0. This is why I am very irritated with the residual values.
Does anybody have any idea why these values appear and how to get rid of them? I have already spent a lot of time on this problem and no idea how to solve it.
Things I noticed while trying to solve it:
Turning all the input variables into integer variables doesn't change
anything
Turning q into an integer variable solves the problem, but ruins the model since a[i][t] needs to be a float variable
Adding a constraint making q >= 0 does not eliminate the negative residual values such as -4e^-14
Adding a constraint making q = 0 for a specific t helps eliminate the residual values there, but of course also ruins the model
Thank you very much for your help!
This is a tolerance issue. MIP solvers such as Cplex have a bunch of them. The ones in play here are integer feasibility tolerance (epint) and the feasibility tolerance (eprhs). You can tighten them, but I usually leave them as they are. Sometimes it helps to round results before printing them or just use fewer digits in the formatting of the output.

Make continuous variable the baseline when interacting with factor variable (stata)

Consider two variables sex and age, where sex is treated as categorical variable and age is treated as continuous variable. I want to specify the full factorial using the non-interacted age-term as baseline (the default for sex##c.age is to omit one of the sex-categories).
The best I could come up with, so far, is to write out the factorial manually (and omitting `age' from the regression):
reg y sex#c.age i.sex
This is mathematically equivalent to
reg y sex##c.age
but allows me to directly infer the regression coefficients (and standard errors!) on the interaction term sex*age for both genders.
Is there a way to stick to the more economic "##" notation but make the continuous variable the omitted category?
(I know the manual approach has little overhead notation in the example given here, but the overhead becomes huge when working with triple-interaction terms).
Not a complete answer, but a workaround: One can use `lincom' to obtain linear combinations of coefficients and standard errors. E.g., one could obtain the combined effects for both sexes as follows:
reg y sex##c.age
forval i = 1/2 {
lincom age + `i'.sex#c.age
}

stata: inequality constraint in xttobit

Is it possible to constrain parameters in Stata's xttobit to be non-negative? I read a paper where the authors said they did just that, and I am trying to work out how.
I know that you can constrain parameters to be strictly positive by exponentially transforming the variables (e.g. gen x1_e = exp(x1)) and then calling nlcom after estimation (e.g. nlcom exp(_b[x1:_y]) where y is the independent variable. (That may not be exactly right, but I am pretty sure the general idea is correct. Here is a similar question from Statlist re: nlsur).
But what would a non-negative constraint look like? I know that one way to proceed is by transforming the variables, for example squaring them. However, I tried this with the author's data and still found negative estimates from xttobit. Sorry if this is a trivial question, but it has me a little confused.
(Note: this was first posted on CV by mistake. Mea culpa.)
Update: It seems I misunderstand what transformation means. Suppose we want to estimate the following random effects model:
y_{it} = a + b*x_{it} + v_i + e_{it}
where v_i is the individual random effect for i and e_{it} is the idiosyncratic error.
From the first answer, would, say, an exponential transformation to constrain all coefficients to be positive look like:
y_{it} = exp(a) + exp(b)*x_{it} + v_i + e_{it}
?
I think your understanding of constraining parameters by transforming the associated variable is incorrect. You don't transform the variable, but rather you fit your model having reexpressed your model in terms of transformed parameters. For more details, see the FAQ at http://www.stata.com/support/faqs/statistics/regression-with-interval-constraints/, and be prepared to work harder on your problem than you might have expected to, since you will need to replace the use of xttobit with mlexp for the transformed parameterization of the tobit log-likelihood function.
With regard to the difference between non-negative and strictly positive constraints, for continuous parameters all such constraints are effectively non-negative, because (for reasonable parameterization) a strictly positive constraint can be made arbitrarily close to zero.

How to interpret the constant in an AREG output?

I am running regressions with fixed effects in Stata, using areg, and I just realized its reports a constant term. Allegedly, what areg does is perform a "within" transformation to the data and then run OLS on the transformed data. However, the constant term from the original model is destroyed by the "within" transformation.
Therefore, what does the constant reported by areg mean? Is it a programming mistake? I don't think so, because areg does not allow the -nocons- option, and it would seem that the reason is connected to the meaning of the constant.
With areg (as well as xtreg with the fe option), the intercept is fit so that y-bar minus x-bar times beta-hat is equal to zero. In other words, Stata calculates an intercept that makes the prediction calculated at the means of the independent variables equal to the mean of the dependent variable.
As said in the Stata manual:
The intercept reported by areg deserves some explanation because, given k mutually exclusive and exhaustive dummies, it is arbitrary. areg identifies the model by choosing the intercept that makes the prediction calculated at the means of the independent variables equal to the mean of the dependent variable: y = xb.
areg does give a different constant than using regress.
areg is kind of tricky as it is based on the assumption that the (absorbed) no. of groups will not increase with sample size.

How do you handle deadlocks using Proc Discrim in SAS for KNN?

I have a proc discrim statement which runs a KNN analysis. When I set k = 1 then it assigns everything a category (as expected). But when k > 1 it leaves some observations unassigned (sets category as Other).
I'm assuming this is a result of deadlock votes for two or more of the categories. I know there are ways around this by either taking a random one of the deadlocked votes as the answer, or taking the nearest of the deadlocked votes as the answer.
Is this functionality available in proc discrim? How do you tell it how to deal with deadlocks?
Cheers!
Your assumption that the assignment of an observation to the "Other" class results from the same probability of assignment to two or more of the designated classes is correct when the number of nearest neighbors is two or more. You can see this by specifying the PROC DISCRIM statement option, OUT=SASdsn, to write a SAS output data set of how well the procedure classified the input observations. This output data set contains probabilities for assignment to each of the designated classes. For example, using two nearest neighbors (K=2) with the iris data set yields five observations that the procedure classifies as ambiguous, with a probability of 0.50 for being assigned to either the Versicolor or the Virginica class. From the output data set, you can select these ambiguously classified observations and assign them randomly to these classes in a subsequent DATA step. Or, you can compare the values of the variables used to classify these ambiguously classified observations to the means of these values for each of the classes, perhaps by calculating a squared distance +/- standardized by the standard deviation of each value and by assigning the observation to the "closest" class.