Official Stata command for bivariate normal probability density function - stata

I see that Stata has binormal command for computing bivariate cumulative distribution function but not corresponding (official) command for computing bivariate probability density function. (Please let me know if I am wrong). I know that there is a user-written function bnormpdf for that but unlike the official commands like normalden for univariate probability density function, the variable to be generated appears at the right hand side.
bnormpdf x1 x2, rho(.2) dens(pdf_b) double
I was wondering whether this pattern will have any effects when programming, for example maximum likelihood (this may seem too broad).

It's not clear what you are worried about, but in Stata terms you have functions and commands the wrong way round here.
Commands and functions are totally disjoint in Stata.
A command may call a function and in Stata that is the only way to use a function. But a function may not call a command.
Users cannot write functions in Stata. (Users can write egen functions and Mata functions but neither of those categories is relevant here.) Only Stata's developers can write Stata functions.
Note that some (occasional) users of Stata prefer to ignore Stata's own terminology in discussing Stata, perhaps because they regard it as perverse. I don't recommend that. Stata's terminology choices are open to discussion, but you need to understand Stata's terminology before you can discuss it.
All that said, I don't think there is much to add on your question.
http://www.stata.com/manuals13/dfunctions.pdf documents binormal() which is, in Stata terms, a function, not a command.
bnormpdf (SSC) is in contrast not a function but a command.
However, nothing stops you using either or both within your own programs. The syntax is necessarily different, so you must use bnormpdf to create a new variable before you use that variable for your own purposes. You can't use a call to bnormpdf within some other command.

Related

Exclude variables in regression based on causality-criteria in SAS

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

What is a proper method for minimizing st deviation of dependent variable (e.g. clustering?)

I'm stuck with minimizing the st deviation of a dependent variable being time difference in days. The mean is OK, but the deviation is terrible. Tried clustering by independent variables and noticed quite dissimilar clusters. Now, wondering:
1) How I can actually apply this knowledge from clustering to the independent variable? The fact is that it was not included in initial clustering analysis, as I know it's dependent on the others.
2) Given that I know the variable of time difference is dependent, should I run clustering of it with the variable of cluster number being the result of my initial clustering analysis? Would it help?
3) Is there any other technique apart from clustering that can help me somehow categorize observation groups so that for each group I would have a separate mean of the independent variable with low st deviation?
Any help highly appreciated!
P.S. I was using Stata and SPSS, though I can also use SAS if you can share the code.
It sounds like you're going about this all wrong. Here are some relevant points to consider.
It's more important for the variance to be consistent across groups than it is to be low.
Clustering is (generally) going to organize individuals based on similar patterns of the clustering variables.
Fewer observations will generally not decrease the size of your standard deviation.
Any time you take continuous variables (either IV or DVs) and convert them into categorical variables, you are removing variance from the equation, and including more measurement error. Sometimes there are good reasons to do this, often times there are not.
Analysis should be theory-driven whenever possible, as data driven analysis (like what you're trying to accomplish here) is more likely to produce results that can't be reproduced or generalized to other data sets, samples, or populations.

SAS Enterprise Guide, different treatments for missing variables

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)
On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.

Stata - Tobit - Lagrange Multiplier Test

Community,
I am running a left- and right-censored tobit regression model. The dependent variable is the proportion of cash used in M&A transactions running from 0 to 1.
I assume heteroskedasticity to be prevalent due to the characteristics of my cross-sectional sample as well as the BPCW test for the LS regression model. In order to test the tobit specifications, I used bctobit. However, bctobit is not applicable for right-censored data.
This gives rise to the following question:
- Is there another user-written command to test for the tobit specifications with right- and left-censored data?
Thanks a lot for your efforts!
The most immediate question to me here is statistical. From what you say this approach is inadvisable, so how to implement it is immaterial.
I don't think Tobit makes much sense for variables that are defined to lie in an interval. Censoring to me implies that some high or low values might have been observed in principle, but in practice are recorded as less extreme values. It seems to me that logit or probit are the appropriate link functions for proportional responses, and in that Stata that means glm with e.g. logit link.
Regardless of that, do you regard linear dependence as expected here?
For an excellent concise review making this point, see http://www.stata-journal.com/sjpdf.html?articlenum=st0147

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.