How do I conditionally select variables in PROC SQL? - sas

I have calculated a frequency table in a previous step. Excerpt below:
I want to automatically drop all variables from this table where the frequency is missing. In the excerpt above, that would mean the variables "Exkl_UtgUtl_Taxi_kvot" and "Exkl_UtgUtl_Driv_kvot" would need to be dropped.
I try the following step in PROC SQL (which ideally I will repeat for all variables in the table):
PROC SQL;
CREATE TABLE test3 as
SELECT (CASE WHEN Exkl_UtgUtl_Flyg_kvot!=. THEN Exkl_UtgUtl_Flyg_kvot ELSE NULL END)
FROM stickprovsstorlekar;
quit;
This fails, however, since SAS does not like NULL values. How do I do this?
I tried just writing:
PROC SQL;
CREATE TABLE test3 as
SELECT (CASE WHEN Exkl_UtgUtl_Flyg_kvot!=. THEN Exkl_UtgUtl_Flyg_kvot)
FROM stickprovsstorlekar;
quit;
But that just generates a variable with an automatically generated name (like DATA_007). I want all variables containing missing values to be totally excluded from the results.

Let's say you have 10 variables, where var1, var3, var5, var7, and var9 have missing values in the first observation. We want to select only the variables with no missing observations.
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
. 8 . 9 . 6 . 1 . 4
5 1 2 7 2 7 2 9 7 7
5 9 7 7 6 8 5 6 4 9
...
First, let's find all variables that have missing observations:
proc means data=have noprint;
var _NUMERIC_;
output out=missing nmiss=;
run;
Then transpose this output table so it's easier to work with:
proc transpose data=missing out=missing_tpose;
run;
We now have a table that looks like this:
_NAME_ COL1
_TYPE_ 0
_FREQ_ 10
var1 1
var2 0
var3 1
var4 0
var5 1
var6 0
var7 1
var8 0
var9 1
var10 0
When COL1 is > 0 and the name is not _TYPE_ or _FREQ_, that means the variable has missing values. Let's extract the name of the variable from _NAME_ into a comma-separated list.
proc sql noprint;
select _NAME_
into :vars separated by ','
from missing_tpose
where COL1 = 0 AND _NAME_ NOT IN('_TYPE_', '_FREQ_')
;
quit;
%put &vars and you'll see all of the non-missing values that can be passed into SQL.
var2,var4,var6,var8,var10
Now we have a dynamic way to select variables with only non-missing values.
proc sql;
create table want as
select &vars
from have
;
quit;

Related

How to create 2x2 table in sas for fisher exact test

I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;

How can I get the top row out of the following data in sas using proc sql

Dataset a:-
cc dob enrolled
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
SAS Code:-
proc sql;
select distinct count(*) as cust_enrolled ,year(enrolled) as yr
from a
group by yr
order by cust_enrolled desc;
quit;
Result:-
cust_enrolled yr
5 2010
1 2013
1 2004
1 1999
1 2001
1 2007
My query is to get the first row from this result. How can I achieve this?
Typically I would use a having clause testing an aggregate such as freq=max(freq). However, since freq is already an aggregate count(*) that has to be in a sub-select.
Example:
data have;
input cc dob: mmddyy10. enrolled: mmddyy10.;
format dob enrolled mmddyy10.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
proc sql;
create table most_popular_enrollment_year as
select * from
(select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
;
quit;
If there are multiple years with the max number of year enrollment count the query will return multiple rows. If you want the earliest year of those you need another nesting.
proc sql;
create table earliest_most_popular as
select * from
(
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
)
having yr_enroll=min(yr_enroll)
;
quit;
Another way is to sort by yr_enroll and use Proc SQL option OUTOBS=1 to grab the first
proc sql outobs=1;
create table earliest_most_popular as
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
order by yr_enroll
;
reset outobs=max;
You can use the OUTOBS option of PROC SQL to control how many observations the SELECT statement writes to the output destination(s).
First let's convert your listing into an actual dataset.
data have;
input cc dob :mmddyy. enrolled :mmddyy.;
format dob enrolled date9.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
Now let's run your SELECT statement with OUTOBS set to 1. Make sure to give it some criteria for deciding which observation to take when there are ties for the largest count.
proc sql outobs=1;
select year(enrolled) as yr
, count(*) as cust_enrolled
from have
group by yr
order by cust_enrolled desc, yr
;
quit;
Results:
cust_
yr enrolled
----------------------
2010 5
You can use data set options anywhere. SQL doesn't guarantee an order so you often will want logic that's more complicated than simply the first, but if that's what you want using the OBS=1 option is a decent option.
proc sql;
select * from sashelp.class(obs=1);
quit;
If you want something besides the first, use FIRSTOBS and OBS together.
proc sql;
select * from sashelp.class(firstobs=10 obs=10);
quit;

Replace Missing Values with Mean of column in SAS

I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?
There are only 5 or so columns so the script doesn't need to go through the entire data set.
PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)
For example:
data missing_class;
set sashelp.class;
if _N_=5 then call missing(age);
if _N_=7 then call missing(height);
if _N_=9 then call missing(weight);
run;
proc stdize data=missing_class out=imputed_class
method=mean reponly;
var age height weight;
run;
Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.
/* Set up data */
data have(index=(sex) );
set sashelp.class;
if(_N_ IN(3,7,9,12) ) then call missing(height);
run;
/* Calculate mean of all non-missing values */
proc means data=have noprint;
by sex;
output out=means mean(height) = imp_height;
run;
/* Merge avg. values with original data */
data want;
merge have
means;
by sex;
if(missing(height) ) then height = imp_height;
drop imp_height;
run;
You can use the mean function in proc sql to replace only the missing observations in each column:
data temp;
input var1 var2 var3 var4 var5;
datalines;
. 2 3 4 .
6 7 8 9 10
. 12 . . 15
16 17 18 19 .
21 . 23 24 25
;
run;
proc sql;
create table temp2 as select
case when missing(var1) then mean(var1) else var1 end as var1,
case when missing(var2) then mean(var2) else var2 end as var2,
case when missing(var3) then mean(var3) else var3 end as var3,
case when missing(var4) then mean(var4) else var4 end as var4,
case when missing(var5) then mean(var5) else var5 end as var5
from temp;
quit;
And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:
coalesce(var1, mean(var1)) as var1

Make a SAS data column into a Macro variable?

How can I convert the output of a SAS data column into a macro variable?
For example:
Var1 | Var2
-----------
A | 1
B | 2
C | 3
D | 4
E | 5
What if I want a macro variable containing all of the values in Var1 to use in a PROC REG or other procedure? How can I extract that column into a variable which can be used in other PROCS?
In other words, I would want to generate the equivalent statement:
%LET Var1 =
A
B
C
D
E
;
But I will have different results coming from a previous procedure so I can't just do a '%LET'. I have been exploring SYMPUT and SYMGET, but they seem to apply only to single observations.
Thank you.
proc sql;
select var1
into :varlist separated by ' '
from have;
quit;
creates &varlist. macro variable, separated by the separation character. If you don't specify a separation character it creates a variable with the last row's value only.
There are a lot of other ways, but this is the simplest. CALL SYMPUTX for example will do the same thing, except it's complicated to get it to pull all rows into one.
You can use it in a proc directly, no need for a macro variable. I used numeric values for your var1 for simplicity, but you get the idea.
data test;
input var1 var2 ##;
datalines;
1 100 2 200 3 300 4 400 5 500
run;
proc reg data=TEST;
MODEL VAR1 = VAR2;
RUN;

How to rename variables without using their original names?

I have a data set that I am uploading to sas. There are always 4 variables in the exact same order. The problem is sometimes the variables could have slightly different names.
For example the first variable user . The next day i get the same dataset, it might be userid . . . So I cannot use rename(user=my_user)
Is there any way i could refer to the variable by their order . . something like this
rename(var_order_1=my_user) ;
rename(var_order_3=my_inc) ;
rename _ALL_=x1-x4 ;
There are a few ways to do this. One is to determine the variable names from PROC CONTENTS or dictionary.columns and generate rename statements.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
%macro rename(var=,newvar=);
rename &var.=&newvar.;
%mend rename;
data my_vars; *the list of your new variable names, and their variable number;
length varname $10;
input varnum varname $;
datalines;
1 FirstVar
2 SecondVar
3 ThirdVar
4 FourthVar
;;;;
run;
proc sql; *Create a list of macro calls to the rename macro from joining dictionary.columns with your data. ;
* Dictionary.columns is like proc contents.;
select cats('%rename(var=',name,',newvar=',varname,')')
into :renamelist separated by ' '
from dictionary.columns C, my_vars M
where C.memname='HAVE' and C.libname='WORK'
and C.varnum=M.varnum;
quit;
proc datasets;
modify have;
&renamelist; *use the calls;
quit;
Another is to put/input the data using the input stream and the _INFILE_ automatic variable (that references the current line in the input stream). Here's an example. You would of course keep only the new variables if you wanted.
data have;
input x1-x4;
datalines;
1 2 3 4
5 6 7 8
;;;;
run;
data want;
set have;
infile datalines truncover; *or it will go to next line and EOF prematurely;
input #1 ##; *Reinitialize to the start of the line or it will eventually EOF early;
_infile_=catx(' ',of _all_); *put to input stream as space delimited - if your data has spaces you need something else;
input y1-y4 ##; *input as space delimited;
put _all_; *just checking our work, for debugging;
datalines; *dummy datalines (could use a dummy filename as well);
;;;;
run;
Here is another approach using the dictionary tables..
data have;
format var1-var4 $1.;
call missing (of _all_);
run;
proc sql noprint;
select name into: namelist separated by ' ' /* create macro var */
from dictionary.columns
where libname='WORK' and memname='HAVE' /* uppercase */
order by varnum; /* should be ordered by this anyway */
%macro create_rename(invar=);
%do x=1 %to %sysfunc(countw(&namelist,%str( )));
/* OLDVAR = NEWVARx */
%scan(&namelist,&x) = NEWVAR&x
%end;
%mend;
data want ;
set have (rename=(%create_rename(invar=&namelist)));
put _all_;
run;
gives:
NEWVAR1= NEWVAR2= NEWVAR3= NEWVAR4=