SAS stdi output clarification - sas

I'm attempting to convert some old SAS macros into Python, and am a bit unclear by some of the terminology used in SAS. In the macro, the PROC statement is
proc reg data=model_file;
model &y = &x;
output out=&outfile r=resid stdi=resid_error;
I understand that r means the individual residual per data point, but was unclear what stdi meant. According to the SAS manual, stdi means "standard error of the individual predicted value", so there is one stdi for each row in the dataset. I searched around a bit and found this lecture slide from the University of Wisconsin which I believe explains how to calculate stdi:
However, two (EDIT: ONE) questions remain:
Is the method for calculating standard error of individual
prediction in the lecture slide indeed correct? I've never seen this
method before so I still have my doubts. I've looked up the SAS manual, but their definition for STDI is a bit confusing:. Specifically, h(i) is defined as but I don't know what the bar after [X'X] is supposed to mean.
The way the standard error of individual predictions is calculated
here utilizes x. However, what happens if you have run a
regression with multiple X columns? Does stdi assume only a
single X column?
Answer: the answer is no. You can have multiple X columns and still a STDI value.

I'm not a statistician and your question could have included a lot more detail, but a quick Google search suggests that you're looking at a PROC REG. The main documentation for PROC REG is here:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_reg_sect015.htm
and there is a dedicated page for "Model Fit and Diagnostic Statistics" including the relevant formulae here:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_reg_sect039.htm
Maybe that will answer your question. Although these things don't interest me directly, I believe that SAS's documentation is pretty good at always describing the exact computations being done by each procedure.

Related

what is difference between datalines and cards statement

I am seeking job on SAS tool. interviewer asked me this question I said no difference, both are used to read internal data but he told there is a difference between cards and datalines statement .
kindly help me to crack this question.
There is no significant difference between the two. CARDS is defined as an alias of DATALINES, and by definition an alias should have identical behavior to the statement it is an alias of.
There is an insignificant difference, which is really simply an undefined behavior; it is explained on page 4 of Andrew Kuligowski's 2007 paper, and related to how cards and datalines have slightly different results when used in a file statement (which they are not intended to be used with). The behavior is slightly different now for datalines in modern SAS (9.4); it gives an error, but a different one from cards. However, it is simply undefined behavior; it would be an absurd thing for an interviewer to expect as an answer.
Your interviewer may have been referring to datalines versus datalines4, or cards versus cards4; those are different, because the latter requires four ; to end instead of one (to allow for the possibility of a semicolon in the data itself).

Exclude variables in regression based on causality-criteria in SAS

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

SAS Enterprise Guide, different treatments for missing variables

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)
On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.

Official Stata command for bivariate normal probability density function

I see that Stata has binormal command for computing bivariate cumulative distribution function but not corresponding (official) command for computing bivariate probability density function. (Please let me know if I am wrong). I know that there is a user-written function bnormpdf for that but unlike the official commands like normalden for univariate probability density function, the variable to be generated appears at the right hand side.
bnormpdf x1 x2, rho(.2) dens(pdf_b) double
I was wondering whether this pattern will have any effects when programming, for example maximum likelihood (this may seem too broad).
It's not clear what you are worried about, but in Stata terms you have functions and commands the wrong way round here.
Commands and functions are totally disjoint in Stata.
A command may call a function and in Stata that is the only way to use a function. But a function may not call a command.
Users cannot write functions in Stata. (Users can write egen functions and Mata functions but neither of those categories is relevant here.) Only Stata's developers can write Stata functions.
Note that some (occasional) users of Stata prefer to ignore Stata's own terminology in discussing Stata, perhaps because they regard it as perverse. I don't recommend that. Stata's terminology choices are open to discussion, but you need to understand Stata's terminology before you can discuss it.
All that said, I don't think there is much to add on your question.
http://www.stata.com/manuals13/dfunctions.pdf documents binormal() which is, in Stata terms, a function, not a command.
bnormpdf (SSC) is in contrast not a function but a command.
However, nothing stops you using either or both within your own programs. The syntax is necessarily different, so you must use bnormpdf to create a new variable before you use that variable for your own purposes. You can't use a call to bnormpdf within some other command.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.