I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?
There are only 5 or so columns so the script doesn't need to go through the entire data set.
PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)
For example:
data missing_class;
set sashelp.class;
if _N_=5 then call missing(age);
if _N_=7 then call missing(height);
if _N_=9 then call missing(weight);
run;
proc stdize data=missing_class out=imputed_class
method=mean reponly;
var age height weight;
run;
Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.
/* Set up data */
data have(index=(sex) );
set sashelp.class;
if(_N_ IN(3,7,9,12) ) then call missing(height);
run;
/* Calculate mean of all non-missing values */
proc means data=have noprint;
by sex;
output out=means mean(height) = imp_height;
run;
/* Merge avg. values with original data */
data want;
merge have
means;
by sex;
if(missing(height) ) then height = imp_height;
drop imp_height;
run;
You can use the mean function in proc sql to replace only the missing observations in each column:
data temp;
input var1 var2 var3 var4 var5;
datalines;
. 2 3 4 .
6 7 8 9 10
. 12 . . 15
16 17 18 19 .
21 . 23 24 25
;
run;
proc sql;
create table temp2 as select
case when missing(var1) then mean(var1) else var1 end as var1,
case when missing(var2) then mean(var2) else var2 end as var2,
case when missing(var3) then mean(var3) else var3 end as var3,
case when missing(var4) then mean(var4) else var4 end as var4,
case when missing(var5) then mean(var5) else var5 end as var5
from temp;
quit;
And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:
coalesce(var1, mean(var1)) as var1
Related
I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;
I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/- 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or - 3x standard deviation, or in worst case remove them.
You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted;
set df;
if test > 3 then test=3;
if test < -3 then test =-3;
run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
No sample data provided so I generated 5 random variables with N(0,2) distribution for the purpose of demonstration for removing outliers from N(0,1).
If you have multiple columns to remove outliers from, you could create a macro or just loop through an array.
DATA have;
INPUT var1 var2 var3 var4 var5;
DATALINES;
-0.8458048655231136 -2.1737985573160485 -2.122482432573275 1.8746296707673902 -2.799009287469253
-1.9927731684115295 1.8230096873238637 0.5964656531490122 -1.6465532407305106 3.9430012045284184
0.0294083016125659 1.3877418982525658 -1.3398372120124733 -0.8195179339297752 4.742490300459201
-0.5215716306745832 -3.35412129416837 1.1558155344985737 -1.0073681302151822 2.425914724408619
-2.817574234024364 3.5161858163738424 -2.1822454739704744 0.060674570200235534 0.25898913069677443
-3.941905381717187 4.969013776451821 2.021891632999466 -1.1526212617289868 1.2864391876960568
;
run;
* variable of all columns to remove outliers from ;
%LET column_names=var1 var2 var3 var4 var5;
DATA want;
SET have;
ARRAY columns {*} &column_names.;
DO i=1 to dim(columns);
if columns[i]>3 then columns[i]=3;
if columns[i]<-3 then columns[i]=-3;
END;
DROP i;
RUN;
What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.
Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;
One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;
For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;
I have a data set that I am uploading to sas. There are always 4 variables in the exact same order. The problem is sometimes the variables could have slightly different names.
For example the first variable user . The next day i get the same dataset, it might be userid . . . So I cannot use rename(user=my_user)
Is there any way i could refer to the variable by their order . . something like this
rename(var_order_1=my_user) ;
rename(var_order_3=my_inc) ;
rename _ALL_=x1-x4 ;
There are a few ways to do this. One is to determine the variable names from PROC CONTENTS or dictionary.columns and generate rename statements.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
%macro rename(var=,newvar=);
rename &var.=&newvar.;
%mend rename;
data my_vars; *the list of your new variable names, and their variable number;
length varname $10;
input varnum varname $;
datalines;
1 FirstVar
2 SecondVar
3 ThirdVar
4 FourthVar
;;;;
run;
proc sql; *Create a list of macro calls to the rename macro from joining dictionary.columns with your data. ;
* Dictionary.columns is like proc contents.;
select cats('%rename(var=',name,',newvar=',varname,')')
into :renamelist separated by ' '
from dictionary.columns C, my_vars M
where C.memname='HAVE' and C.libname='WORK'
and C.varnum=M.varnum;
quit;
proc datasets;
modify have;
&renamelist; *use the calls;
quit;
Another is to put/input the data using the input stream and the _INFILE_ automatic variable (that references the current line in the input stream). Here's an example. You would of course keep only the new variables if you wanted.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
data want;
set have;
infile datalines truncover; *or it will go to next line and EOF prematurely;
input #1 ##; *Reinitialize to the start of the line or it will eventually EOF early;
_infile_=catx(' ',of _all_); *put to input stream as space delimited - if your data has spaces you need something else;
input y1-y4 ##; *input as space delimited;
put _all_; *just checking our work, for debugging;
datalines; *dummy datalines (could use a dummy filename as well);
;;;;
run;
Here is another approach using the dictionary tables..
data have;
format var1-var4 $1.;
call missing (of _all_);
run;
proc sql noprint;
select name into: namelist separated by ' ' /* create macro var */
from dictionary.columns
where libname='WORK' and memname='HAVE' /* uppercase */
order by varnum; /* should be ordered by this anyway */
%macro create_rename(invar=);
%do x=1 %to %sysfunc(countw(&namelist,%str( )));
/* OLDVAR = NEWVARx */
%scan(&namelist,&x) = NEWVAR&x
%end;
%mend;
data want ;
set have (rename=(%create_rename(invar=&namelist)));
put _all_;
run;
gives:
NEWVAR1= NEWVAR2= NEWVAR3= NEWVAR4=
How can you create a SAS data set from another dataset using only the last n observations from original dataset. This is easy when you know the value of n. If I don't know 'n' how can this be done?
This assumes you have a macro variable that says how many observations you want. NOBS tells you the number of observations in the dataset currently without reading the whole thing.
%let obswant=5;
data want;
set sashelp.class nobs=obscount;
if _n_ gt (obscount-&obswant.);
run;
Using Joe's example of a macro variable to specify the number of observations you want, here is another answer:
%let obswant = 10;
data want;
do _i_=nobs-(&obswant-1) to nobs;
set have point=_i_ nobs=nobs;
output;
end;
stop; /* Needed to stop data step */
run;
This should perform better since it only reads the specific observations you want.
If the dataset is large, you might not want to read the whole dataset. Instead you could try a construction that reads the total number of Observations in the dataset first. So if you want to have the last of observations:
data t;
input x;
datalines;
1
2
3
4
;
%let dsid=%sysfunc(open(t));
%let num=%sysfunc(attrn(&dsid,nlobs));
%let rc=%sysfunc(close(&dsid));
%let number = 2;
data tt;
set t (firstobs = %eval(&num.-&number.+1));
run;
For the sake of variety, here's another approach (not necessarily a better one)
%let obswant=5;
proc sql noprint;
select nlobs-&obswant.+1 into :obscalc
from dictionary.tables
where libname='SASHELP' and upcase(memname)='CLASS';
quit;
data want;
set sashelp.class (firstobs=&obscalc.);
run;
You can achive this using the
_nobs_ and _n_ variables. First, create a temporary variable to store the total no of obs. Then compare the automatic variable N to nobs.
data a;
set sashelp.class nobs=_nobs_;
if _N_ gt _nobs_ -5;
run;