SAS. calculate % from DISTINCT COUNT - sas

I am working in SAS Studio Version: 2022.09.
I am working with survey data and will be tracking Region-Facility that has not submitted a survey in over 3 weeks. Surveys are voluntary but ideally facilities will submit a new survey weekly.
Region
Facility (Type&Name)
Date Survey Submitted
North
Hospital-Baptist Hospital
1/01/2023
South
PCP-Family Care
1/01/2023
North
PCP- Primary Medical
1/08/2023
South
PCP-Family Care
1/08/2023
North
Hospital-Baptist Hospital
1/15/2023
North
Hospital-St Mary Hospital
1/15/2023
West
Daycare-Early Learning
1/15/2023
West
Hospital-Methodist
1/15/2023
South
Daycare-Early Learning
1/15/2023
To obtain a list of facilities by region that submitted before but have not submitted in 3 weeks. Since we do not expect to be successful with every facility, we will stop following facilities after 10 weeks.
Data have;
set want;
DaysDiff=intck('day', Date, today());
run;
proc sort data=have;
by Facility Region Date;
run;
data have;
set have;
by Facility;
if last.Facility;
run;
proc sort data=have
out=SurveysMissing;
BY Region Facility;
WHERE DaysDiff>21 AND DaysDiff<70;
run;
To assist in determining significance of losing facilities that had not submitted recently, I would like to obtain a %.
[Total # of facilities per REGION that have not submitted survey >21 <70] / [Total # of facilities per REGION that have reported in the last 10 weeks]
/*#facilities not submitted >21 AND <70 /*
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff>21 AND DaysDiff <70
GROUP BY Region;
run;
/*Count of Distinct Facilities per Region*/
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff <70
GROUP BY Region;
run;
Would I need to create tables and do a left join to calculate %?
Thanks.

In Proc SQL a true condition resolves to 1 and false to 0. You can leverage this feature to compute the ratio of sums of expressions or binary flags.
Example:
Compute the ratio based on a subquery that flags facilities
proc sql;
create table want as
select
region, sum (isquiet_flag) / sum (submitted_flag) label = 'Fraction of quiet facilities'
from
( select region, facility
, min(today() - date_submitted ) > 21 as isquiet_flag
, min(today() - date_submitted ) < 70 as submitted_flag
from have
where today() - date_submitted < 70
group by region, facility
)
group by
region
;

In your last data step for have, add an indicator for missing survery.
data have;
set have;
by Facility;
if last. Facility;
surverymissing = (daysdiff > 21); * contains 1 if condition is true, otherwise 0;
run;
Then use proc summary to compute your numerator and denominator for each region. The numerator is the sum of surveymissing while the denominator is the count of the same.
proc summary data=have nway;
where daysdiff < 70;
class region;
var surveymissing;
output out=region_summary (drop=_:) sum=SurveysMissing n=TotalFacilities;
run;

Related

SAS How to sum a variable in duplicate records

Noob SAS user here.
I have a hospital data set with patientID and a variable that counts the days between admission and discharge.
Those patients who had more than one hospital admission show up with the same patientID and with a record of how many days they were in hospital each time.
I want to sum the total days in hospital per patient, and then only have one patientID record with the sum of all hospital days across all stays. Does anyone know how I would go about this?
You want to select distinct the sum of days_in_hospital and group by patientID This will get what you want:
proc sql;
create table want as
select distinct
patientID,
sum(days_in_hospital) as sum_of_days
from have
group by patientID;
quit;
Alternatively you can use proc summary.
proc summary data= hospital_data nway;
class patientID;
var days;
output out=summarized_data (drop = _type_ _freq_) sum=;
run;
This creates a new dataset called summarized_data which has the summed days for each patientID. (The nway option removes the overall summary row, and the drop statement removes extra default summary columns you don't need.)

Missing values in VARMAX

I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;

SAS Macro help to loop monthly sas datasets

I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).
I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.
Cust_Expense
------------
Customer_Number Month
111 FEB2014
987 APR2017
784 FEB2014
768 APR2017
.....
145 AUG2017
345 AUG2014
I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time.
Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.
Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
create table want_&month. as
select *
from cust_expense(where=(month="&month.")) t1
inner join cust_&month. t2
on t1.customer_number=t2.customer_number
;
%end;
quit;
%mend;
%want;
Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
create table want as
select *
from cust_expense t1
inner join (
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
%if &i>1 %then union;
select *, "&month." as month
from cust_&month
%end;
) t2
on t1.customer_number=t2.customer_number
and t1.month=t2.month
;
quit;
%mend;
%want;
In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.
Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.
If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).
Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.
data want;
set jan feb mar apr ... ;
where put(id,CUSTEXPF1.) = '1';
run;
That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).
I guess you could output your data in specific dataset like this example :
data test;
infile datalines dsd;
input ID : $2. MONTH $3. ;
datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
;
run;
data JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;
You will avoid to loop through all your ID (250000)
and you will use dataset statement from SAS
At the end you will get 12 DATASET containing the ID related.
If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :
...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...
Regards

How can I assign values to dataset based on time and overlapping numerical ranges? - SAS

I have a credit card transaction dataset (let's call it "Trans") with transaction amount, zip code, and date. I have another dataset (let's call it "Key") that lists sales tax rates based on date and geocode. The Key dataset also includes a range of zip codes associated with each geocode represented by 2 variables: Zip Start and Zip End.
Because Geocodes don't align with zip codes, some of the zip code ranges overlap. If this happens, I want to use the lowest sales tax rate associated with the zip code shown in Trans.
Trans dataset:
TransAmount TransDate TransZip
$200 01/07/1998 90010
$12 02/09/2002 90022
Key dataset:
Geocode Rate StartDate EndDate ZipStart ZipEnd
1001 .0825 199701 200012 90001 90084
1001 .085 200101 200812 90001 90084
1002 .0825 199701 200012 90022 90024
1002 .08 200101 200812 90022 90024
Desired output:
TransAmount TransDate TransZip Rate
$200 01/07/1998 90010 .0825
$12 02/09/2002 90022 .08
I used this basic SQL code in SAS, but I run into the problem of overlapping zip codes.
proc sql;
create table output as
select a.*, b.zipstart, b.zipend, b.startdate, b.enddate, b.rate
from Trans.CA_Zip_Cd_Testing a left join Key.CA_rates b
on a.TranZip ge b.zipstart
and a.TranZip le b.zipend
and a.TransDate ge b.StartDate
and a.transDate le b.EndDate
;
quit;
Well the easiest way to do this as far as the query portion is to just add a subquery to get the min rate.
Select t.transamount, t.transdate,t.transzip
,(Select MIN(rate) from Key where t.transzip between ZipStart and ZipEnd and t.transdate between startdate and enddate) 'Rate'
from trans t
You could also do it as subquery and join on it.
The SAS SQL Optimizer can be good sometimes. Other times, it can be a challenge. This code is going to be a bit more complicated, but it will likely be faster, and subject to size constraints on your key table.
data key;
set key;
dummy_key=1;
run;
data want(drop=dummy_key geocode rate startDate endDate zipStart zipEnd rc i);
if _n_ = 1 then do;
if 0 then set key;
declare hash k (dataset:'key',multidata:'y');
k.defineKey('dummy_key');
k.defineData('geocode','rate','startdate','enddate','zipstart','zipend');
k.defineDone();
end;
call missing (of _all_);
set trans;
dummy_key=1;
rc = k.find();
do i=1 to 1000 while (rc=0);
transZipNum = input(transZip,8.); *converts character zip to number. if its already a number then remove;
zipStartNum = input(zipStart,8.);
zipEndNum = input(zipEnd,8.);
if startDate <= transDate <= endDate then do;
if zipStartNum <= transZipNum <= zipEndNum then do;
rate_out = min(rate_out,rate);
end;
end;
rc=k.find_next();
end;
run;

SAS forward looking Moving standard deviation

Hi does anyone know how to calculate the standard deviation over the next four quarters for each quarter? Thanks :)
My attempt is below:
date1 is the sas date for the quarter in a year
Proc sql ; create table th.totalroll as
Select distinct permco, date1 ,
(select std(adjret) from th.returns1 where qtr between
intnx('quarter',qtr(date),0) and intnx('quarter', qtr(date),+3)) as
TOTALroll From th.returns1 group by permco ,date1;
QUIT;
It's hard to tell how close you are because I'm not entirely certain what your data looks like, but here's an example assuming you have more than one date in each quarter. Create sample data:
data have;
format date date9.;
do m = 1 to 128;
date = intnx('month','01JAN2008'd,m-1);
amount = round(ranuni(date)*10);
output;
end;
drop m;
run;
Using proc sql, create quarter variable (you might already have this variable?) and group by this variable. Use a having clause to restrict results to the first date of each quarter.
proc sql;
create table want as
select
yyq(year(t1.date),qtr(t1.date)) as quarter format=yyq.,
(select std(t2.amount)
from have t2
where t2.date >= yyq(year(t1.date),qtr(t1.date))
and t2.date < intnx('quarter',yyq(year(t1.date),qtr(t1.date)),4)) as stddev
from
have t1
group by
calculated quarter
having
t1.date = min(t1.date)
;
quit;
You should be able to adapt this to work for your data.
You can use proc expand if your dataset is already in quarterly. So something like this:
proc expand data=th.returns1
out=th.totalroll
from=quarter
to=quarter;
by permco date1;
id date;
convert adjret=TOTALroll / transformout=( MOVSTD 4 );
run;
Don't forget to sort you data first. And MOVSTD gives you backward moving standard deviation. You may need to shift the output stream back by 4 quarters if you want the forward moving STD.
Transformation Operations for proc expand:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_expand_sect026.htm