SAS Enterprise Guide, different treatments for missing variables - sas

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)

On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.

Related

Recode or case_when issue in combining variables using or

I am trying to use recode or case_when to create a subset of my dataset that includes variables that indicate a risk factor for HIV. Essentially, I have four variables that indicate HIV risk and I want to combine them so that I can say if this factor is present OR this factor is present OR this factor (to include the 4 factors).
Then I want to combine those to eliminate the people in my data set who do not have at least 1 of the 4 risk factors.
Can anyone help with the coding needed to do this?
Thanks so much, this is my first time typing a question on this site.
I have tried various versions of recode and case_when and ifelse all of which have not worked. I used similar code to determine who is HIV negative in my data set but that was based on Yes or No, and the risk factors follow an OR logic. Meaning I want to include all of the people who have at least 1 of the risk factors.

Best way to feature select using PCA (discussion)

Terminology:
Component: PC
loading-score[i,j]: the j feature in PC[i]
Question:
I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.
Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.
I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].
Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.
Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,
In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.
Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z
More information about you may refer to this original source where I took about two figures.
Coming back to your proposition,
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]
I would not recommend this way due to the following reasons:
abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.
Edit:
In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.
All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

Exclude variables in regression based on causality-criteria in SAS

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

What is a proper method for minimizing st deviation of dependent variable (e.g. clustering?)

I'm stuck with minimizing the st deviation of a dependent variable being time difference in days. The mean is OK, but the deviation is terrible. Tried clustering by independent variables and noticed quite dissimilar clusters. Now, wondering:
1) How I can actually apply this knowledge from clustering to the independent variable? The fact is that it was not included in initial clustering analysis, as I know it's dependent on the others.
2) Given that I know the variable of time difference is dependent, should I run clustering of it with the variable of cluster number being the result of my initial clustering analysis? Would it help?
3) Is there any other technique apart from clustering that can help me somehow categorize observation groups so that for each group I would have a separate mean of the independent variable with low st deviation?
Any help highly appreciated!
P.S. I was using Stata and SPSS, though I can also use SAS if you can share the code.
It sounds like you're going about this all wrong. Here are some relevant points to consider.
It's more important for the variance to be consistent across groups than it is to be low.
Clustering is (generally) going to organize individuals based on similar patterns of the clustering variables.
Fewer observations will generally not decrease the size of your standard deviation.
Any time you take continuous variables (either IV or DVs) and convert them into categorical variables, you are removing variance from the equation, and including more measurement error. Sometimes there are good reasons to do this, often times there are not.
Analysis should be theory-driven whenever possible, as data driven analysis (like what you're trying to accomplish here) is more likely to produce results that can't be reproduced or generalized to other data sets, samples, or populations.

Two way clustering in ordered logit model, restricting rstudent to mitigate outlier effects

I have an ordered dependent variable (1 through 21) and continuous independent variables. I need to run the ordered logit model, clustering by firm and time, eliminating outliers with Studentized Residuals <-2.5 or > 2.5. I just know ologit command and some options for the command; however, I have no idea about how to do two way clustering and eliminate outliers with studentized residuals:
ologit rating3 securitized retained, cluster(firm)
As far as I know, two way clustering has only been extended to a few estimation commands (like ivreg2 from scc and tobit/logit/probit here). Eliminating outliers can easily be done on your own and there's no automated way of doing it.
Use the logit2.ado from the link Dimitriy gave (Mitchell Petersen's website) and modify it to use the ologit command. It's simple enough to do with a little trial and error. Good luck!
If you have a variable with 21 ordinal categories, I would have no problems treating that as a continuous one. If you want to back that up somehow, I wrote a paper on welfare measurement with ordinal variables, see DOI:10.1111/j.1475-4991.2008.00309.x. Then you can use ivreg2. You should be aware of all the issues involved with that estimator, in particular, that it implicitly assumed that the correlations are fully modeled by this two-way structure, and observations for firms i and j and times t and s are definitely uncorrelated for i!=j and t!=s. Sometimes, this is a strong assumption to make -- i.e., New York and New Jersey may be correlated in 2010, but New York 2010 is uncorrelated with New Jersey 2009.
I have no idea of what you might mean by ordinal outliers. Somebody must have piled a bunch of dissertation advice (or worse analysis requests) without really trying to make sense of every bit.