Exclude variables in regression based on causality-criteria in SAS - sas

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)

I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.

If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

Related

Recode or case_when issue in combining variables using or

I am trying to use recode or case_when to create a subset of my dataset that includes variables that indicate a risk factor for HIV. Essentially, I have four variables that indicate HIV risk and I want to combine them so that I can say if this factor is present OR this factor is present OR this factor (to include the 4 factors).
Then I want to combine those to eliminate the people in my data set who do not have at least 1 of the 4 risk factors.
Can anyone help with the coding needed to do this?
Thanks so much, this is my first time typing a question on this site.
I have tried various versions of recode and case_when and ifelse all of which have not worked. I used similar code to determine who is HIV negative in my data set but that was based on Yes or No, and the risk factors follow an OR logic. Meaning I want to include all of the people who have at least 1 of the risk factors.

Is there a way to quantify impact of independent variables with gradient boosting?

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!
Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

SAS Enterprise Guide, different treatments for missing variables

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)
On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".