How to calculate quantile data for table of frequencies in SAS? - sas

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...

You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;

It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);

Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.

Related

SAS. calculate % from DISTINCT COUNT

I am working in SAS Studio Version: 2022.09.
I am working with survey data and will be tracking Region-Facility that has not submitted a survey in over 3 weeks. Surveys are voluntary but ideally facilities will submit a new survey weekly.
Region
Facility (Type&Name)
Date Survey Submitted
North
Hospital-Baptist Hospital
1/01/2023
South
PCP-Family Care
1/01/2023
North
PCP- Primary Medical
1/08/2023
South
PCP-Family Care
1/08/2023
North
Hospital-Baptist Hospital
1/15/2023
North
Hospital-St Mary Hospital
1/15/2023
West
Daycare-Early Learning
1/15/2023
West
Hospital-Methodist
1/15/2023
South
Daycare-Early Learning
1/15/2023
To obtain a list of facilities by region that submitted before but have not submitted in 3 weeks. Since we do not expect to be successful with every facility, we will stop following facilities after 10 weeks.
Data have;
set want;
DaysDiff=intck('day', Date, today());
run;
proc sort data=have;
by Facility Region Date;
run;
data have;
set have;
by Facility;
if last.Facility;
run;
proc sort data=have
out=SurveysMissing;
BY Region Facility;
WHERE DaysDiff>21 AND DaysDiff<70;
run;
To assist in determining significance of losing facilities that had not submitted recently, I would like to obtain a %.
[Total # of facilities per REGION that have not submitted survey >21 <70] / [Total # of facilities per REGION that have reported in the last 10 weeks]
/*#facilities not submitted >21 AND <70 /*
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff>21 AND DaysDiff <70
GROUP BY Region;
run;
/*Count of Distinct Facilities per Region*/
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff <70
GROUP BY Region;
run;
Would I need to create tables and do a left join to calculate %?
Thanks.
In Proc SQL a true condition resolves to 1 and false to 0. You can leverage this feature to compute the ratio of sums of expressions or binary flags.
Example:
Compute the ratio based on a subquery that flags facilities
proc sql;
create table want as
select
region, sum (isquiet_flag) / sum (submitted_flag) label = 'Fraction of quiet facilities'
from
( select region, facility
, min(today() - date_submitted ) > 21 as isquiet_flag
, min(today() - date_submitted ) < 70 as submitted_flag
from have
where today() - date_submitted < 70
group by region, facility
)
group by
region
;
In your last data step for have, add an indicator for missing survery.
data have;
set have;
by Facility;
if last. Facility;
surverymissing = (daysdiff > 21); * contains 1 if condition is true, otherwise 0;
run;
Then use proc summary to compute your numerator and denominator for each region. The numerator is the sum of surveymissing while the denominator is the count of the same.
proc summary data=have nway;
where daysdiff < 70;
class region;
var surveymissing;
output out=region_summary (drop=_:) sum=SurveysMissing n=TotalFacilities;
run;

SAS- Calculate Top Percent of Population

I am trying to seek some validation, this may be trivial for most but I am by no means an expert at statistics. I am trying to select patients in the top 1% based on a score within each drug and location. The data would look something like this (on a much larger scale):
Patient drug place score
John a TX 12
Steven a TX 10
Jim B TX 9
Sara B TX 4
Tony B TX 2
Megan a OK 20
Tom a OK 10
Phil B OK 9
Karen B OK 2
The code snipit I have written to calculate those top 1% patients is as follows:
proc sql;
create table example as
select *,
score/avg(score) as test_measure
from prior_table
group by drug, place
having test_measure>.99;
quit;
Does this achieve what I am trying to do, or am going about it all wrong? Sorry if this is really trivial to most.
Thanks
There are multiple ways to calculate and estimate a percentile. A simple way is to use PROC SUMMARY
proc summary data=have;
var score;
output out=pct p99=p99;
run;
This will create a data set named pct with a variable p99 containing the 99th percentile.
Then filter your table for values >=p99
proc sql noprint;
create table want as
select a.*
from have as a
where a.score >= (select p99 from pct);
quit;

How to delete all the duplicate observations but add a column with the frequency in SAS?

In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;

SAS Recursive Join

I have a large table of connections, and would like to expand that table to include recursive connections.
My data looks like this --
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
I'd like expand the data set to include stopovers, so it ends up looking like this. I'm not concerned about whether I have both a "PORTLAND/SEATTLE" and a "SEATTLE/PORTLAND" record -- I can handle those afterwards as necessary.
BOISE HELENA
BOISE PORTLAND
BOISE SEATTLE
NYC MIAMI
NYC ORLANDO
ORLANDO MIAMI
PORTLAND HELENA
PORTLAND SEATTLE
SEATTLE HELENA
I've tried using the following macro, but ran into performance problems when there were too many levels of recursion. I believe the best option would be hash tables, but am not sure how to code this precise scenario.
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
%macro RecurJoin(
baseTbl,
destTbl,
baseKey,
compKey
);
Proc SQL;
Create Table WORK.RECUR_JOIN_TBL as
SELECT distinct Base.&baseKey, Connect.&compkey
FROM &baseTbl AS Base
INNER JOIN &baseTbl AS Connect
ON (Base.&compkey = Connect.&baseKey)
LEFT JOIN &baseTbl AS Subbase
ON (Base.&baseKey = Subbase.&baseKey) AND
(Connect.&compkey = Subbase.&compkey)
WHERE Subbase.&baseKey IS NULL;
quit;
proc sql noprint;
select count(1) into :connectCnt from RECUR_JOIN_TBL;
quit;
Data &destTbl;
set &baseTbl
RECUR_JOIN_TBL;
run;
Proc DataSets nolist;
Delete RECUR_JOIN_TBL;
Quit;
%if &connectCnt > 0 %then %do;
%RecurJoin(
baseTbl=&destTbl,
destTbl=&destTbl,
baseKey=&baseKey,
compKey=&compKey
);
%end;
%mend;
%RecurJoin(
baseTbl=city_list,
destTbl=FNL_CITY_LIST,
baseKey=from_city,
compKey=to_city
);
Proc Sort data=WORK.FNL_CITY_LIST (where=(NOT(from_city=to_city)));
by from_city to_city;
run;
Memory allowing, you can use the hash-based approach I came up with in this answer to identify the groups of connected cities within your dataset. Then you just need to generate a row for every pair of cities within the same group, which can easily be done via a cartesian join in proc sql.

PROC UNIVARIATE: Output trimmed mean to dataset by id

Following the question asked about throwing out the trimmed mean of the proc univariate in a table :
SAS: PROC UNIVARIATE: Output trimmed mean to dataset
I would like to output a trimmed mean from a proc univariate by group. However the ods output does not seem to work with noprint and there are just too many group id for it to work out . Any sidestep to this problem?
proc univariate data = Table1 idout trim=1;
var DaysBtwPay;
by id;
trimmedmeans = trimMean2 (keep = id Mean stdMean);
run;
Doh!, it appears you can use ods _all_ close; option to suppress the HTML output instead of going through the trouble of writing your own datastep routine.
%let trimmed = 1;
proc sort data=sashelp.shoes out=have;
by region;
run;
ods _all_ close;
PROC UNIVARIATE DATA=have trimmed=&trimmed. ;
VAR returns;
by region;
ods output TrimmedMeans=trimmedMeansUni ;
run;
I can't think of anyway around this issue other than to write your own data step to calculate trimmed mean.
This can be done in 2 steps.
Step-1:
In this step we would want to know how many observations are there in each by-group and the simple average of your measurement variable. In the next step, simple average will be returned when calculating trimmed mean is infeasible. For example: If you want to exclude the extreme 5 obs but you only 7 observations in a by-group then proc univarite returns a missing value.
Notice the order by clause - this ordering is used in excluding extreme obs.
proc sql;
create table inputForTmeans as
select
a.region /*your by-group var*/
,a.returns /*your measurement variable of interest*/
,b.count
,b.simpleAvgReturns
from sashelp.shoes as a
inner join (select region, count(*) as count, mean(returns) as simpleAvgReturns
from sashelp.shoes
group by region) as b
on a.region = b.region
order by
a.region
,a.returns;
quit;
Step-2:
%let trimmed = 1; /*no. of extreme obs to exclude from mean calculation*/
data trimmedMean;
set inputForTmeans;
row_count+1; /*counter variable to number each obs in a by-group*/
by region returns;
if first.region then do;
row_count=1;
returnsSum=.;
end;
if &trimmed.<row_count <=(count - &trimmed.) then returnsSum+returns;
/***************************************************************************/
if last.region then do;
trimmedMeanreturns = coalesce(returnsSum/(count - 2*&trimmed.), simpleAvgReturns) ;
N = row_count;
trimmedRowCount = 2*&trimmed.;
output;
end;
keep region trimmedMeanreturns N count trimmedRowCount ;
/***************************************************************************/
run;
Output:
%let trimmed = 1;
region DataStep ProcUnivariate
Africa 1183.962963 1183.963
Asia 662 662
Canada 3089.1428571 3089.143
Central America/Caribbean 3561.1333333 3561.133
Eastern Europe 2665.137931 2665.138
Middle East 6794.3181818 6794.318
Pacific 1538.5813953 1538.581
South America 1824.1153846 1824.115
United States 4462.4210526 4462.421
Western Europe 2538.05 2538.05
%let trimmed = 14;
region DataStep ProcUnivariate
Africa 897.92857143 897.9
Asia 778.21428571 .
Canada 1098.1111111 1098.1
Central America/Caribbean 2289.25 2289.3
Eastern Europe 2559.6666667 2559.7
Middle East 8620 .
Pacific 895.88235294 895.9
South America 1538.5769231 1538.6
United States 4010.4166667 4010.4
Western Europe 1968.5882353 1968.6
Output from the datastep for trimmed=1:
Count: No. of rows in the by-group
N: Ignore this column - same as Count
trimmedRowCount: no. of extreme rows excluded. If trimmedRowCount = Count then trimmedMeanreturns is the SimpleAverage
Region count trimmedMeanreturns N trimmedRowCount
Africa 56 1183.963 56 2
Asia 14 662 14 2
Canada 37 3089.143 37 2
Central America/Caribbean 32 3561.133 32 2
Eastern Europe 31 2665.138 31 2
Middle East 24 6794.318 24 2
Pacific 45 1538.581 45 2
South America 54 1824.115 54 2
United States 40 4462.421 40 2
Western Europe 62 2538.05 62 2