proc sql perform same operation over multiple columns - sas

I have a dataset with 20 columns all starting with the name morb_, which are all 1 or 2, coded as No and Yes. There is an additional column called Pat_TNO which is the patient reference number. Patients have more than one row.
I wish to create a new dataset which summarises whether each patient has had at least one of each type of event. So far the code I have written works perfectly, but is there a way to simplify it using an array?
proc sql;
select
Pat_TNO,
max(morb_1) as morb_1 format yn.,
max(morb_2) as morb_2 format yn. /* etc etc */
from morbidity
group by Pat_TNO;
quit;
COumn names aren't morb_1 and morb_2, rather morb_amputation, morb_mi, morb_tia, etc.

proc summary data=morbidity nway missing;
class pat_tno;
output out=max max(morb_:) = ;
run;

Related

SAS targed encored

Hi,
Can someone explain to me what a given code sequence does step by step?**
I must describe it in detail what is happening in turn
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number &var);
set new;
rename freq_encode=&var;
run;
data new;
set new;
keep &var;
run;
data dane(drop = &var);
set dane;
run;
data dane;
set dane;
set new;
run;
The SQL is first finding the frequency of each value of the variable. Then it divides those counts by the total number of non-missing values and rounds that percentage to two decimal places (or integers when you think of the ratio as a percentage).
This could be done in one step with:
proc sql noprint;
create table new as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,&var as values,count(&var) as number
from &dataset
group by &var
)
;
quit;
It is not clear what the DANE dataset is supposed to be. If &DATESET does not equal DANE then those last four data steps make no sense. If it does then it is a convoluted way to replace the original variable with the percentage.
The first one is basically trying to rename the calculated percentage as the original variable and eliminate the original variable and the other two intermediate variables used in calculating the percentage.
The second one is dropping all of the variables except the new percentage.
The third one is dropping the original variable from "dane".
The last one is adding the new variable back to "dane".
Assuming DANE should be replaced with &DATASET then those four data steps could be reduced to one:
data &dataset;
set &dataset(drop=&var);
set new(keep=freq_encode rename=(freq_encode=&var));
run;
It is probably best not to overwrite your original dataset in that way. So perhaps you should add an OUT parameter to your macro to name to new dataset you want to create.
You could have avoided all of those data steps by just adding the DROP= and RENAME= dataset options to the dataset generated by the SQL query.
So perhaps you want something like this:
%macro frequency_encoding(dataset, var,out);
proc sql noprint;
create table &out(drop=&var number rename=(freq_encode=&var)) as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,count(&var) as number
from &dataset
group by &var
)
;
quit;
%mend ;
%frequency_encoding(sashelp.class,sex,work.class);

Iteratively adding to merged SAS dataset

I have 18 separate datasets that contain similar information: patient ID, number of 30-day equivalents, and total day supply of those 30-day equivalents. I've output these from a dataset that contains those 3 variables plus the medication class (VA_CLASS) and the quarter it was captured in (a total of 6 quarters).
Here's how I've created the 18 separate datasets from the snip of the dataset shown above:
%macro rx(class,num);
proc sql;
create table dm_sum&clas._qtr&num as select PatID,
sum(equiv_30) as equiv_30_&class._&num
from dm_qtrs
where va_class = "HS&class" and dm_qtr = &qtr
group by 1;
quit;
%mend;
%rx(500,1);
%rx(500,2);
%rx(500,3);
%rx(500,4);
%rx(500,5);
%rx(500,6);
%rx(501,1);
and so on...
I then need to merge all 18 datasets back together by PatID and what I'd like to do is iteratively add the next dataset created to the previous, as in, add dataset dm_sum_500_qtr3 to a file that already contains the results of dm_sum_500_qtr1 & dm_sum_500_qtr1.
Thanks for looking, Brian
In the macro append the created data set to it an accumulator data set. Be sure to delete it before starting so there is a fresh accumulation. If the process is run at different times (like weekly or monthly) you may want to incorporate a unique index to prevent repeated appendings. If you are stacking all these sums, the create table should also select va_class and dm_qtr
%macro (class, num, stack=perm.allClassNumSums);
proc sql; create table dm_sum&clas._qtr&num as … ;
proc append force base=perm.allClassNumSums data=dm_sum&clas._qtr#
run;
%mend;
proc sql;
drop table perm.allClassNumSums;
%rx(500,1)
%rx(500,2)
%rx(500,3)
%rx(500,4)
%rx(500,5)
…
A better approach might be a single query with an larger where, and leave the class and qtr as categorical variables. Your current approach is moving data (class and qtr) into metadata (column names). Such a transformation makes additional downstream processing more difficult.
Proc TABULATE or REPORT can be use a CLASS statement to assist the creation of output having category based columns. These procedures might even be able to work directly with the original data set and not require a preparatory SQL query.
proc sql;
create table want as
select
PatID, va_class, dm_qtr,
sum(equiv_30) as equiv_30_sum
from dm_qtrs
where catx(':', va_class, dm_sqt) in
(
'HS500:1'
'HS500:2'
'HS500:3'
…
'HS501:1'
)
group by PatID, va_class, dm_qtr;
quit;

SAS equivalent to R’s is.element()

It’s the first time that I’ve opened sas today and I’m looking at some code a colleague wrote.
So let’s say I have some data (import) where duplicates occur but I want only those which have a unique number named VTNR.
First she looks for unique numbers:
data M.import;
set M.import;
by VTNR;
if first.VTNR=1 then unique=1;
run;
Then she creates a table with the duplicated numbers:
data M.import_dup1;
set M.import;
where unique^=1;
run;
And finally a table with all duplicates.
But here she is really hardcoding the numbers, so for example:
data M.import_dup2;
set M.import;
where VTNR in (130001292951,130100975613,130107546425,130108026864,130131307133,130134696722,130136267001,130137413257,130137839451,130138291041);
run;
I’m sure there must be a better way.
Since I’m only familiar with R I would write something like:
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
I guess there must be something like the $ also for sas?
To me it looks like the most direct translation of the R code
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
Would be to use SQL code
proc sql;
create table import_dup2 as
select * from import
where VTNR in (select VTNR from import_dup1)
;
quit;
But if your intent is to find the observations in IMPORT that have more than one observation per VTNR value there is no need to first create some other table.
data import_dup2 ;
set import;
by VTNR ;
if not (first.VTNR and last.VTNR);
run;
I would use the options in PROC SORT.
Make sure to specify an OUT= dataset otherwise you'll overwrite your original data.
/*Generate fake data with dups*/
data class;
set sashelp.class sashelp.class(obs=5);
run;
/*Create unique and dup dataset*/
proc sort data=class nouniquekey uniqueout=uniquerecs out=dups;
by name;
run;
/*Display results - for demo*/
proc print data=uniquerecs;
title 'Unique Records';
run;
proc print data=dups;
title 'Duplicate Records';
run;
Above solution can give you duplicates but not unique values. There are many possible ways to do both in SAS. Very easy to understand would be a SQL solution.
proc sql;
create table no_duplicates as
select *
from import
group by VTNR
having count(*) = 1
;
create table all_duplicates as
select *
from import
group by VTNR
having count(*) > 1
;
quit;
I would use Reeza's or Tom's solution, but for completeness, the solution most similar to R (and your preexisting code) would be three steps. Again, I wouldn't use this here, it's excess work for something you can do more easily, but the concept is helpful in other situations.
First, get the dataset of duplicates - either her method, or proc sort.
proc sort nodupkey data=have out=nodups dupout=dups;
by byvar;
run;
Then pull those into a macro list:
proc sql;
select byvar
into :duplist separated by ','
from dups;
quit;
Then you have them in &duplist. and can use them like so:
data want;
set have;
if not (byvar in &duplist.);
run;
data want;
set import;
where VTNR in import_dup1;
run;

SAS sum variables using name after a proc transpose

I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.

Can this multi-step process be reduced to one proc sql statement?

I've been trying to make my code more efficient and this is the original code, but I think it can be written in one step.
data TABLE;set ORIGINAL_DATA;
Multi=percent*total_units;
keep Multi Type;
proc sort; by Type;
proc means noprint data=TABLE1; by Type; var Multi;output out=Table2(drop= _type_ _freq_)sum=Multi;run;
proc means noprint data=Table1; var Multi;output out=Table3(drop= _type_ _freq_) sum=total ;run;
proc sql;
create table TABLE4as
select a.Type, a.Multi label="Multi", b.total label="total"
from TABLE2 a, TABLE3 b
order by Type;
quit;
data TABLE5;set TABLE4;
pct=(MULTI/total)*100;
run;
I am able to split up part of it, but I can't figure out how to get the PCT part in my code. This is what I have.
proc sql;
create table TABLE1 as
select distinct type, sum(percent*total_units) as MULTI label "MULTI",
MULTI/(percent*total_units)) as PCT
from ORIGINAL_DATA
group by type;
quit;
I had to edit some of the code but I think the general idea should make sense.
The main problem is I cannot call upon the MULTI column because it is just being created but I want to create a percentage of the total for each type.
The "SAS" way to do something like this is to use a CLASS statement with PROC MEANS. That will calculate statistics on all the interaction levels in the data (identified by the TYPE variable). The row where TYPE=0 will be the "total" value, representing the value of that statistic for the entire data set.
In your case, we can take advantage of the fact that PROC MEANS will create the output data set sorted by TYPE and by the variables listed in the CLASS statement. That means we can just read the first observation and save it's value for calculating percentages.
It's probably easier to just show some code:
data TABLE;
set ORIGINAL_DATA;
Multi = percent * total_units;
keep Multi Type;
run;
proc means noprint data=TABLE;
class Type;
var multi;
output out=next sum=;
run;
data want;
retain total;
set next;
if _n_ = 1 then do;
/* The first obs will be the _TYPE_=0 record */
total = multi;
delete;
end;
pct = (multi / total) * 100;
drop total _freq_ _type_;
run;
Notice that you do not need to sort the data before using PROC MEANS. That's because we are using a CLASS statement rather than a BY statement. The data step is using the first observation in the data set created by MEANS (the TYPE=0 record) to retain the total sum of your variable. The delete statement keeps it out of the result.
CLASS statements with PROC MEANS are very useful. Take a few minutes to read up on how the TYPE variable is calculated, especially if you try using more than one class variable.
You can skip the initial data step by using the WEIGHT option in VAR statement of PROC MEANS (this will effectively do the multiplication for you). You can also use PROC TABULATE instead of PROC MEANS, as tabulate can calculate the percentage. I believe the following code will produce your required output in one go.
ods noresults;
proc tabulate data=have out=want (drop=_: rename=(total_units_sum=total total_units_pctsum_0=pct));
class type;
var total_units / weight=percent;
table type, total_units*(sum pctsum);
run;
ods results;
If you need one step, maybe this will work, but it's not actually efficient, since it processes data twice, once for detail by TYPE, once for total.
proc sql;
create table TABLE1 as
select
d.type
, sum(d.percent*d.total_units) as MULTI label "MULTI"
, calculated MULTI/s.total as PCT
from ORIGINAL_DATA d,
( select sum(percent*total_units) as total
from ORIGINAL_DATA) s
group by type
;
quit;
For more efficiency, but in more than one steps you could simply replace tables withe views in your original code:
data TABLE; => data TABLE / view=TABLE;
create table TABLE4 => create view TABLE4