PROC UNIVARIATE: Output trimmed mean to dataset by id - sas

Following the question asked about throwing out the trimmed mean of the proc univariate in a table :
SAS: PROC UNIVARIATE: Output trimmed mean to dataset
I would like to output a trimmed mean from a proc univariate by group. However the ods output does not seem to work with noprint and there are just too many group id for it to work out . Any sidestep to this problem?
proc univariate data = Table1 idout trim=1;
var DaysBtwPay;
by id;
trimmedmeans = trimMean2 (keep = id Mean stdMean);
run;

Doh!, it appears you can use ods _all_ close; option to suppress the HTML output instead of going through the trouble of writing your own datastep routine.
%let trimmed = 1;
proc sort data=sashelp.shoes out=have;
by region;
run;
ods _all_ close;
PROC UNIVARIATE DATA=have trimmed=&trimmed. ;
VAR returns;
by region;
ods output TrimmedMeans=trimmedMeansUni ;
run;

I can't think of anyway around this issue other than to write your own data step to calculate trimmed mean.
This can be done in 2 steps.
Step-1:
In this step we would want to know how many observations are there in each by-group and the simple average of your measurement variable. In the next step, simple average will be returned when calculating trimmed mean is infeasible. For example: If you want to exclude the extreme 5 obs but you only 7 observations in a by-group then proc univarite returns a missing value.
Notice the order by clause - this ordering is used in excluding extreme obs.
proc sql;
create table inputForTmeans as
select
a.region /*your by-group var*/
,a.returns /*your measurement variable of interest*/
,b.count
,b.simpleAvgReturns
from sashelp.shoes as a
inner join (select region, count(*) as count, mean(returns) as simpleAvgReturns
from sashelp.shoes
group by region) as b
on a.region = b.region
order by
a.region
,a.returns;
quit;
Step-2:
%let trimmed = 1; /*no. of extreme obs to exclude from mean calculation*/
data trimmedMean;
set inputForTmeans;
row_count+1; /*counter variable to number each obs in a by-group*/
by region returns;
if first.region then do;
row_count=1;
returnsSum=.;
end;
if &trimmed.<row_count <=(count - &trimmed.) then returnsSum+returns;
/***************************************************************************/
if last.region then do;
trimmedMeanreturns = coalesce(returnsSum/(count - 2*&trimmed.), simpleAvgReturns) ;
N = row_count;
trimmedRowCount = 2*&trimmed.;
output;
end;
keep region trimmedMeanreturns N count trimmedRowCount ;
/***************************************************************************/
run;
Output:
%let trimmed = 1;
region DataStep ProcUnivariate
Africa 1183.962963 1183.963
Asia 662 662
Canada 3089.1428571 3089.143
Central America/Caribbean 3561.1333333 3561.133
Eastern Europe 2665.137931 2665.138
Middle East 6794.3181818 6794.318
Pacific 1538.5813953 1538.581
South America 1824.1153846 1824.115
United States 4462.4210526 4462.421
Western Europe 2538.05 2538.05
%let trimmed = 14;
region DataStep ProcUnivariate
Africa 897.92857143 897.9
Asia 778.21428571 .
Canada 1098.1111111 1098.1
Central America/Caribbean 2289.25 2289.3
Eastern Europe 2559.6666667 2559.7
Middle East 8620 .
Pacific 895.88235294 895.9
South America 1538.5769231 1538.6
United States 4010.4166667 4010.4
Western Europe 1968.5882353 1968.6
Output from the datastep for trimmed=1:
Count: No. of rows in the by-group
N: Ignore this column - same as Count
trimmedRowCount: no. of extreme rows excluded. If trimmedRowCount = Count then trimmedMeanreturns is the SimpleAverage
Region count trimmedMeanreturns N trimmedRowCount
Africa 56 1183.963 56 2
Asia 14 662 14 2
Canada 37 3089.143 37 2
Central America/Caribbean 32 3561.133 32 2
Eastern Europe 31 2665.138 31 2
Middle East 24 6794.318 24 2
Pacific 45 1538.581 45 2
South America 54 1824.115 54 2
United States 40 4462.421 40 2
Western Europe 62 2538.05 62 2

Related

SAS. calculate % from DISTINCT COUNT

I am working in SAS Studio Version: 2022.09.
I am working with survey data and will be tracking Region-Facility that has not submitted a survey in over 3 weeks. Surveys are voluntary but ideally facilities will submit a new survey weekly.
Region
Facility (Type&Name)
Date Survey Submitted
North
Hospital-Baptist Hospital
1/01/2023
South
PCP-Family Care
1/01/2023
North
PCP- Primary Medical
1/08/2023
South
PCP-Family Care
1/08/2023
North
Hospital-Baptist Hospital
1/15/2023
North
Hospital-St Mary Hospital
1/15/2023
West
Daycare-Early Learning
1/15/2023
West
Hospital-Methodist
1/15/2023
South
Daycare-Early Learning
1/15/2023
To obtain a list of facilities by region that submitted before but have not submitted in 3 weeks. Since we do not expect to be successful with every facility, we will stop following facilities after 10 weeks.
Data have;
set want;
DaysDiff=intck('day', Date, today());
run;
proc sort data=have;
by Facility Region Date;
run;
data have;
set have;
by Facility;
if last.Facility;
run;
proc sort data=have
out=SurveysMissing;
BY Region Facility;
WHERE DaysDiff>21 AND DaysDiff<70;
run;
To assist in determining significance of losing facilities that had not submitted recently, I would like to obtain a %.
[Total # of facilities per REGION that have not submitted survey >21 <70] / [Total # of facilities per REGION that have reported in the last 10 weeks]
/*#facilities not submitted >21 AND <70 /*
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff>21 AND DaysDiff <70
GROUP BY Region;
run;
/*Count of Distinct Facilities per Region*/
proc sql;
SELECT Count(Distinct Facility) AS Count, Region
FROM have
WHERE DaysDiff <70
GROUP BY Region;
run;
Would I need to create tables and do a left join to calculate %?
Thanks.
In Proc SQL a true condition resolves to 1 and false to 0. You can leverage this feature to compute the ratio of sums of expressions or binary flags.
Example:
Compute the ratio based on a subquery that flags facilities
proc sql;
create table want as
select
region, sum (isquiet_flag) / sum (submitted_flag) label = 'Fraction of quiet facilities'
from
( select region, facility
, min(today() - date_submitted ) > 21 as isquiet_flag
, min(today() - date_submitted ) < 70 as submitted_flag
from have
where today() - date_submitted < 70
group by region, facility
)
group by
region
;
In your last data step for have, add an indicator for missing survery.
data have;
set have;
by Facility;
if last. Facility;
surverymissing = (daysdiff > 21); * contains 1 if condition is true, otherwise 0;
run;
Then use proc summary to compute your numerator and denominator for each region. The numerator is the sum of surveymissing while the denominator is the count of the same.
proc summary data=have nway;
where daysdiff < 70;
class region;
var surveymissing;
output out=region_summary (drop=_:) sum=SurveysMissing n=TotalFacilities;
run;

How can I compare many datasets update several columns based on the max value of a single column in SAS?

I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;

SAS compare two tables with different column order

Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;

Multiple transactions lines to base table SAS

I am new to sas and are trying to handle some customer data, and I'm not really sure how to do this.
What I have:
data transactions;
input ID $ Week Segment $ Average Freq;
datalines;
1 1 Sports 500 2
1 1 PC 400 3
1 2 Sports 350 3
1 2 PC 550 3
2 1 Sports 650 2
2 1 PC 700 3
2 2 Sports 720 3
2 2 PC 250 3
;
run;
What I want:
data transactions2;
input ID Week1_Sports_Average Week1_PC_Average Week1_Sports_Freq
Week1_PC_Freq
Week2_Sports_Average Week2_PC_Average Week2_Sports_Freq Week2_PC_Freq;
datalines;
1 500 400 2 3 350 550 3 3
2 650 700 2 3 720 250 3 3
;
run;
The only thing I got so far is this:
Data transactions3;
SET transactions;
if week=1 and Segment="Sports" then DO;
Week1_Sports_Freq=Freq;
Week1_Sports_Average=Average;
END;
else DO;
Week1_Sports_Freq=0;
Week1_Sports_Average=0;
END;
run;
This will be way too much work as I have a lot of weeks and more variables than just freq/avg.
Really hoping for some tips are, as I'm stucked.
You can use PROC TRANSPOSE to create that structure. But you need to use it twice since your original dataset is not fully normalized.
The first PROC TRANSPOSE will get the AVERAGE and FREQ readings onto separate rows.
proc transpose data=transactions out=tall ;
by id week segment notsorted;
var average freq ;
run;
If you don't mind having the variables named slightly differently than in your proposed solution you can just use another proc transpose to create one observation per ID.
proc transpose data=tall out=want delim=_;
by id;
id segment _name_ week ;
var col1 ;
run;
If you want the exact names you had before you could add data step to first create a variable you could use in the ID statement of the PROC transpose.
data tall ;
set tall ;
length new_name $32 ;
new_name = catx('_',cats('WEEK',week),segment,_name_);
run;
proc transpose data=tall out=want ;
by id;
id new_name;
var col1 ;
run;
Note that it is easier in SAS when you have a numbered series of variable if the number appears at the end of the name. Then you can use a variable list. So instead of WEEK1_AVERAGE, WEEK2_AVERAGE, ... you would use WEEK_AVERAGE_1, WEEK_AVERAGE_2, ... So that you could use a variable list like WEEK_AVERAGE_1 - WEEK_AVERAGE_5 in your SAS code.

How to calculate quantile data for table of frequencies in SAS?

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.