Use Proc Summary Statistic in Another Data Step - sas

I have a dataset that contains a list of employees and the number of hours they work.
Name Hours_CD_Max
Bob 455
Dan 675
Jane 543
Suzzy 575
Emily 234
I used a proc summary datastep to calculate the number of total hours worked by these employees.
Proc summary data = staff;
where position = 'PA_FT_UMC';
var Hours_CD_Max;
output out=PAFT_Only_Staff_Totals
sum = Hours_CD_Max_tot;
run;
I would like to use the 'Hours_CD_Max_tot' statistic (2482) from this proc summary datastep and apply to other occurrences within my code where I need to make calculations.
For example, I want to take each providers current hours and divide it by Hours_CD_Max_tot.
So for example, create a dataset that looks like this
data PA_FT_UMC_StaffingV2;
set PA_FT_UMC_StaffingV1 PAFT_Only_Staff_Totals;
if position = 'PA_FT_UMC';
PA_FT_Max_Percent= Hours_CD_Max/Hours_CD_Max_tot;
run;
Name Hours_CD_Max Hours_CD_Max_tot
Bob 455 2482
Dan 675 2482
Jane 543 2482
Suzzy 575 2482
Emily 234 2482
I realize I can calculate the % of hours worked using a proc freq but I really would like to have that (2482) number to be freely available so I can plug it into more complex equations.

If you want to use a variable from another dataset then pull it into your current code.
If there are BY variables you can merge it onto your other dataset. If this case where you just have one observation just run a SET command on the first iteration of the data step.
data PA_FT_UMC_StaffingV2;
set PA_FT_UMC_StaffingV1 PAFT_Only_Staff_Totals;
if _n_=1 then set PAFT_Only_Staff_Totals(keep=Hours_CD_Max_tot);
if position = 'PA_FT_UMC';
PA_FT_Max_Percent= Hours_CD_Max/Hours_CD_Max_tot;
run;

You need to create a macro variable from it and SQL is the fastest solution.
proc sql noprint;
select sum(hours_cd_max) into :hours_max
from have;
quit;
Then you can use it in your calculations later on, using &hours_max

Related

enter column in a dataset to an array

I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;

SAS- Calculate Top Percent of Population

I am trying to seek some validation, this may be trivial for most but I am by no means an expert at statistics. I am trying to select patients in the top 1% based on a score within each drug and location. The data would look something like this (on a much larger scale):
Patient drug place score
John a TX 12
Steven a TX 10
Jim B TX 9
Sara B TX 4
Tony B TX 2
Megan a OK 20
Tom a OK 10
Phil B OK 9
Karen B OK 2
The code snipit I have written to calculate those top 1% patients is as follows:
proc sql;
create table example as
select *,
score/avg(score) as test_measure
from prior_table
group by drug, place
having test_measure>.99;
quit;
Does this achieve what I am trying to do, or am going about it all wrong? Sorry if this is really trivial to most.
Thanks
There are multiple ways to calculate and estimate a percentile. A simple way is to use PROC SUMMARY
proc summary data=have;
var score;
output out=pct p99=p99;
run;
This will create a data set named pct with a variable p99 containing the 99th percentile.
Then filter your table for values >=p99
proc sql noprint;
create table want as
select a.*
from have as a
where a.score >= (select p99 from pct);
quit;

How to delete all the duplicate observations but add a column with the frequency in SAS?

In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;

Defining variables in one table based on values in another table

I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.
That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.

SAS: Loop over dataset, make temp data step for ith row, do some proc w/ temp data, return results to first dataset

So I have a dataset_a that looks like this:
Name Month
Dick Aug
Dick Sep
Dick Oct
Jane Aug
Jane Sep
...
And some other, much larger dataset_b like this:
Name Day X Y
Dick 12-Jul-13 14.8 2.3
Jane 05-Sep-13 12.2 2.0
Dick 02-Aug-13 15.1 3.2
Dick 07-Aug-13 14.5 3.0
Jane 05-Aug-13 12.8 2.5
Dick 08-Aug-13 14.5 3.0
Dick 10-Aug-13 13.5 2.3
Jane 31-Jul-13 13.0 2.2
...
I want to iterate over it, and for each row in dataset_a, do a data step that gets the appropriate records from dataset_b and puts them in a temp dataset--temp, let's call it. Then I need to do a proc reg on temp and stick the results (row-vector-style) back into dataset_a, like so:
Name Month Parameter-est.-for-Y p-value R-squared
Dick Aug Some # Some # Some #
Dick Sep Some # Some # Some #
Dick Oct Some # Some # Some #
Jane Aug Some # Some # Some #
Jane Sep Some # Some # Some #
...
Here's some code/pseudocode to illustrate my need:
for each row in dataset_a
data temp;
set dataset_b; where name=['i'th name] and month(day)=['i'th month];
run;
proc reg /*noprint*/ alpha=0.1 outest=[?] tableout; model X = Y; run;
/*somehow put these regression results back into 'i'th row of dataset_a*/
next
Please post a comment if something doesn't make sense. Thanks very much in advance!
The efficient approach for this is somewhat different than what you are listing. In the particular instance you show, the most efficient approach would be to use a format to group the Day values into Months, and run your regression by name day, assuming regression respects formats (if not, then create a new variable month and assign that using the format).
For example:
data for_reg/view=for_reg;
set dataset_b;
month=put(day,MONNAME3.);
run;
Or
proc datasets lib=work;
modify dataset_b;
format day MONNAME3.;
quit;
Then
proc reg data=for_reg;
by name month; *or if using the other one, by name day;
**other proc reg statements**;
run;
Then merge that output dataset with dataset_a if needed. It will run the proc reg as if you'd run it once for each name/month combination, but all in one call and one pass through the data.
If PROC REG doesn't respect by groups (and I think it does, but who knows), the best solution is still to do something like this; write a macro to run the proc reg taking arguments of name and month, and call the macro from the dataset_a. Then generate common output files (or proc append them into a single master output dataset in the macro) and merge the result to dataset_a if needed at the end.
Something like
%macro run_procreg(name=,month=);
data for_run/view=for_run;
set dataset_b;
where name=&name. and put(day,MONNAME3.)=&month.;
run;
proc reg data=for_run;
*other stuff*;
output out=tempdataset; *or however you create your output;
run;
proc append base=master_output data=tempdataset force;
run;
%mend run_procreg;
proc sql;
select cats('%run_procreg(name=',name,',month=',month,')') into :macrocalllist
separated by ' ' from dataset_a;
quit;
&macrocalllist;
data fin;
merge dataset_a (in=a) master_output(in=b);
by name month;
run;
You probably don't need to merge on dataset_a at the end if it just has those two variables. This will be a lot slower than one call with by, but if it's necessary, this is the way to do it.
You can also use call execute in the datastep to drive a macro list like above - that's nearly the most similar concept to your stated pseudocode, it's almost identical - but it doesn't return the information back to the data step (it executes after the data step completes), and it's slightly more troublesome than the above method. There is also, in 9.3+, dosubl in the FCMP language which allows you to do a bit closer to what you want, but I don't know it well enough to explain or know that it does indeed meet your needs.