could you please give me some hints for identifying the nature of missingness for categorical variables' missing value? I mean, I gave a fast search on google scholar but I didn't find anything related with this. How could I understand if missing-values are missing completely at random, are they missing at random or finally, they are missing not at random? Except studying the domain I can't think anything. Links to some papers are appreciated, Thanks in advance.
(I'll add it in sas environment but the question is not specifically related with this language).
Since you've tagged this as SAS, one approach you could take would be to create a boolean variable for each of your categorical variables indicating whether or not it has a missing value in each row. Then you could do whatever analysis you like on the frequency of missing values, using the flags. E.g. you could use proc corr to see if missing values of one variable correlate with values of other variables.
E.g. suppose you have a situation like this:
data example;
set sashelp.class;
if AGE > 14 then call missing(SEX);
SEX_MISSING_FLAG = missing(SEX);
run;
Then you could spot it by running the following:
proc corr data = example outp= corr;
var age weight height sex_missing_flag;
run;
Output:
_TYPE_,_NAME_,Age,Weight,Height,SEX_MISSING_FLAG
MEAN,,13.32,100.03,62.34,0.26
STD,,1.49,22.77,5.13,0.45
N,,19.00,19.00,19.00,19.00
CORR,Age,1.00,0.74,0.81,0.78
CORR,Weight,0.74,1.00,0.88,0.64
CORR,Height,0.81,0.88,1.00,0.55
CORR,SEX_MISSING_FLAG,0.78,0.64,0.55,1.00
Related
I'm looking to examine unique combinations of 7 binary variables (cannabis modes of delivery [yes/no]) and was under the impression this should be a fairly simple task in SAS. However, all of the coding examples I've come across online seem a bit overcomplicated for such a basic process. If anyone has insight regarding this concept I would appreciate it!
So if you want the counts probably it easiest to use PROC SUMMARY.
proc summary data=have nway missing ;
class var1-var7 ;
output out=want(rename=(_freq_=count));
run;
proc sort data=yourdataset(Keep=Variables to examine seperated by blanks) out=choose_a_name noduprec;
by Variables to examine seperated by blanks;
run;
Resulting dataset choose_a_name will contain all unique combinations of examined variables.
I am handling a SAS dataset with little observations and I need to put on relation a variable with its lag.
By doing this, I lose a record resulting in a missing values.
Do some of you know how SAS Base handles such items in the procedures as PROC REG or PROC CORR?
Thanks you all in advance.
It depends a bit on what exactly you're doing. Based on your references to PROC REG and CORR - it excludes the missing values - listwise. This means if any value is missing in a row that is being used or referenced in the PROC it will be excluded.
http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p175x77t7k6kggn1io94yedqagl3.htm&docsetVersion=9.4&locale=en
PROC CORR has several options that allow you to specify how it handles the missing values. The NOMISS option tells SAS to use only complete cases.
https://blogs.sas.com/content/iml/2012/01/13/missing-values-pairwise-corr.html
I'm going to ask this with an example...
Suppose i have a data set where each observation represents a person. Two of the variables are AGE and HASADOG (and say this has values 1 for yes and 2 for no.) Is there a way to run a PROC FREQ (by AGE*HASADOG) that forces SAS to include in the report a line for instances where the count is zero?
By this I mean: if there is a particular value for AGE such that no observation with this AGE value has a 1 in the HASADOG variable, the report will still include a row for this combination (with a row percent of 0.)
Is this possible?
The SPARSE option in PROC FREQ is likely all you need.
proc freq data=sashelp.class;
table sex*age / sparse list;
run;
If the value is nowhere in your data set at all, then there's no way for SAS to know it exists. In this case you'd need a more complex solution, basically a way to tell SAS all values you would be using ahead of time. This can be done via a PRELOADFMT or CLASSDATA option on several procs. There are asked an answered questions on this topic here on SO, so I won't provide a solution for this option, which seems beyond the scope of your question.
I have a SAS data set with missing data in multiple columns. I would like replace the missing data with a prediction based on the other data in the data set. Here a link that describes the method but doesn't show me how to do it. How do I replace the missing values with a prediction?
EDIT:
The method I had in mind was just using Proc Reg then apply the coefficents to the missing data to generate the estimate. Does this answer your question?
PROC STDIZE, PROC EXPAND, and PROC MI are all capable of performing different kinds of imputations on your data depending on exactly how you want do determine the 'prediction'.
For simple things like replacing with the mean, PROC STDIZE is the way to go. PROC MI is the most advanced - it performs multiple imputation. PROC EXPAND is appropriate if you have time-series data, as it will try to work out what the correct value is for that point in the time series.
If you have missing data in multiple columns you'll require multiple regressions. This probably isn't a good way to do this, but to answer the question - what you're requesting is called scoring a dataset and you can use PROC SCORE.
An alternative method is in your regression procedure request an OUTPUT data set that contains the predicted values for that regression.
output out=predicted1 p=pred_var_missing;
As a matter of methodology, I recommend #Joe's method instead.
Adding to #Joe 's answer, if you tell us why you want to do this imputation, we can provide better advice. I wrote a blog post called How to Ask a Statistics Question that may help.
However, often, single imputation is a bad method. More particularly, if you are going to do further analysis on this data (with the imputed values) then single imputation will underestimate the variability of the data and give wrong results.
PROC MI is usually a better approach.
proc means data=tableepisodes noprint;
output out=tableepisodes
mean(%ratings %dummies)=%ratings %dummies;
by ProgCodeID ProgSeasonCodeID year week
I was reading through a SAS code and I am not sure what the mean part of the code does ,
Is it that it only takes the mean of %ratings variables and attach the % dummies variables to the output ?
would really appreciate if I could get help in understanding this code snippet
That isn't a complete code snippet, and no.
It calculates the mean of the variables listed in %rating AND %dummies, assuming of course that's what is included in those macros.
Without seeing the macro definitions we can't be sure of what it is actually doing.
As written, the code is going to evaluate the means of the variables stored inside the macro variables ratings and dummies. Taking ratings as an example, we're assuming it was defined earlier on as something like:
%let ratings = good bad ugly;
So, when you pass it through the proc means, %ratings will evaluate to good bad ugly and SAS will take the means of all three variables.
You could have written the proc means function as:
proc means data = tableepisodes noprint;
by ProgCodeID ProgSeasonCodeID year week;
var good bad ugly;
output out = tableepisodes mean= / autoname;
run;
instead. (Also, note that you're overwriting your original dataset here, which you may want to avoid.)