Divide a dataset into subsets based on a column and perform a repeated operation for subsets

Divide a dataset into subsets based on a column and perform a repeated operation for subsets - sas

I need to perform the same operation on many different periods. In my sample data for two periods: 402 and 403.
I cannot understand the concept of how I can make a loop that will do it for me.
At the end, I'd like to have final1 for period 402, final2 for period 403 etc.
Sample data that I use for testing:
data one;
input period $ a $ b $ c $ d e;
cards;
402 a . a 1 3
402 . b . 2 4
402 a a a . 5
402 . . b 3 5
403 a a a . 6
403 a a a . 7
403 a a a 2 8
;
run;
This is how I manually choose one period of one data:
data new;
set one;
where period='402';
run;
This is how I calculate different things for the given period e.g. number of missing data, non-missing, total:
1 - For numeric variables:
proc iml;
use new;
read all var _NUM_ into x[colname=nNames];
n = countn(x,"col");
nmiss = countmiss(x,"col");
ntotal = n + nmiss;
2 - and similarly for char variables:
read all var _CHAR_ into x[colname=cNames];
close nww;
c = countn(x,"col");
cmiss = countmiss(x,"col");
ctotal = c + cmiss;
Save numeric and char results:
create cnt1Data var {nNames n nmiss ntotal};
append;
close cnt1Data;
create cnt2Data var {cNames c cmiss ctotal};
append;
close cnt2Data;
Rename columns to be the same:
data cnt1Datatemp;
set cnt1Data;
rename nNames = Name n = nonMissing nmiss = missing ntotal = total;
run;
data cnt2Datatemp;
set cnt2Data;
rename cNames = Name c = nonMissing cmiss = missing ctotal = total;
run;
and merge data into the final set:
data final;
set cnt1Datatemp cnt2Datatemp;
run;
Final data for period 402 should look like:
a b c d e
2 2 1 1 0 - missing
2 2 3 3 4 - non-missing
4 4 4 4 4 - total
and respectively for period 403:
a b c d e
0 0 0 2 0 - missing
3 3 3 1 3 - non-missing
3 3 3 3 3 - total

You can make something similar with simple SQL query.
create table miss_count as select period
, sum(missing(A)) as A
, sum(missing(B)) as B
...
from have
group by period
;
Results:
period a b c d e
402 2 2 1 1 0
403 0 0 0 2 0
It you add in
, count(*) as nobs
then you have all the information you need to calculate all of the counts you wanted.
If the number of variables is short enough you can even generate the code into a macro variable (limit of 64K bytes in a macro variable)
proc sql noprint;
select catx(' ','sum(missing(',nliteral(name),')) as',nliteral(name))
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='ONE' and lowcase(name) ne 'period'
;
create table miss_count as select period,count(*) as nobs,&varlist
from one
group by period
;
quit;
Results:
period nobs a b c d e
402 4 2 2 1 1 0
403 3 0 0 0 2 0

It is much easier to find this information in sql;
proc sql;
select sum(a is not missing) as fil_a
, sum(a is missing) as mis_a
, count(*) as tot_a
from one
where period eq 402;
quit;
You can even 0handle all periods at once using group by.
There are a few ways to make this work for all variables in a dataset (except for some group by variables). For instance:
%macro count_missing();
proc sql;
select count(*), name
into :no_var, :var_list separated by ' '
from sasHelp.vcolumn
where libName eq 'WORK' and memName eq 'ONE' and upcase(name) ne 'PERIOD';
create view count_missing as
select count(*) as total
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
, sum(&var is missing) as mis_&var
%end;
from work.one
group by period;
quit;
data report_missing;
set count_missing;
format count_of $32.;
count_of = 'missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = mis_&var;
%end;
output;
count_of = 'non missing';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total - mis_&var;
%end;
output;
count_of = 'total';
%do var_nr = 1 %to &no_var;
%let var = %scan(&var_list, &var_nr);
&var = total;
%end;
output;
end;
%mend;
%count_missing();

You don't need iml to summarize data over observations. You can do that with a retain statement too. Moreover, using by processing with first and last, you can process all periods in one go.
data final;
set one;
by period;
if first.period then do;
mis_a = 0;
total = 0;
end;
retain mis_a;
if missing(a) then mis_a +=1; else fil_a += 1;
total += 1;
if last.period;
fil_a = total - mis_a;
end;
This is by far the fastest way to handle a big dataset if the data is sorted by period.
To make it work for a set of variables not known upfront, you can apply the same techniques as in my other solution.

Related

Loop through SAS variables and create data sets

I have a SAS data set t3. I want to run a data step inside a loop through a set of variables to create additional sets based on the variable value = 1, and rank two variables bal and otheramt in each subset, and then merge the ranks for each subset onto the original data set. Each rank column needs to be dynamically named so I know what subset is getting ranked. I know how to do proc rank and macros basically but do not know how to do this in the most dynamic way inside of a macro. Can you assist?
ID
bal
otheramt
firstvar
secondvar
lastvar
444
581
100
1
1
555
255
200
1
1
1
666
255
300
--------------
1
--------------
%macro dog();
data new;
set t3;
ARRAY Indicators(5) FirstVar--LastVar;
/*create data set for each of the subsets if firstvar = 1, secondvar = 1 ... lastvar = 1 */
/*for each new data set, rank by bal and otheramt*/
/*name the new rank columns [FirstVar]BalRank, [FirstVar]OtherAmtRank; */
/*merge the new ranks onto the original data set by ID*/
%mend;
%dog()
The Proc rank section would be something like this, but I would need the rank columns to have information about what subset I am ranking.
proc rank data=subset1 out=subset1ranked;
var bal otheramt;
ranks bal_rank otheramt_rank;
run;

Instead of using macro, use data transformation and reshaping that allows simpler steps to be written.
Example:
Rows are split into multiple rows based on flag so group processing in RANK can occur. Two transposes are required to reshape the results back a single row per id.
data have;
call streaminit(20230216);
do id = 1 to 100;
foo = rand('integer', 50,150);
bar = rand('integer', 100,200);
flag1 = rand('integer', 0, 1);
flag2 = rand('integer', 0, 1);
flag3 = rand('integer', 0, 1);
output;
end;
run;
data step1;
set have;
/* important: the group value becomes part of the variable name later */
if flag1 then do; group='flag1_'; output; end;
if flag2 then do; group='flag2_'; output; end;
if flag3 then do; group='flag3_'; output; end;
drop flag:;
run;
proc sort data=step1;
by group;
run;
proc rank data=step1 out=step2;
by group;
var foo bar;
ranks foo_rank bar_rank;
run;
proc sort data=step2;
by id group;
run;
* pivot (reshape) so there is one row per ranked var;
proc transpose data=step2 out=step3(drop=_label_);
by id foo bar group;
var foo_rank bar_rank;
run;
* pivot again so there is one row per id;
proc transpose data=step3 out=step4(drop=_name_);
by id;
var col1;
id group _name_;
run;
* merge so those 0 0 0 flag rows remain intact;
data want;
merge have step4;
by id;
run;

Since we don't have much sample data, I created test data from sashelp.class with some indicator variables like yours.
data have;
set sashelp.class;
firstvar=round(rand('uniform',1));
secondvar=round(rand('uniform',1));
thirdvar=round(rand('uniform',1));
drop sex weight;
run;
Partial output:
Name Age Height firstvar secondvar thirdvar
Alfred 14 69 1 0 1
Alice 13 56.5 0 1 1
Barbara 13 65.3 1 0 0
Carol 14 62.8 0 0 0
To dynamically rank data based on indicator variables, I created a macro that accepts a list of indicators and rank variables. The 2 lists help to create the specific variable names you requested. Here's the macro call:
%rank(indicators=firstvar secondvar thirdvar,
rank_vars=age height);
Here's part of the final output. Notice the indicators in the sample output above coincide with the ranks in this output. Also note that Carol is not in the output because she had no indicators set to 1.
Name Age Height firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank thirdvar_height_rank
Alfred 14 69 8 11 . . 6.5 10
Alice 13 56.5 . . 3.5 2 4.5 2
Barbara 13 65.3 6.5 8 . . . .
Henry 14 63.5 . . 5.5 5 . .
The full macro is listed below. It has 3 parts.
Create a temp data set with a group variable that contains the number of the indicator variable based on the order of the variable in the list. Whenever an indicator = 1 the obs is output. If an obs has all 3 indicators set to 1 then it will be output 3 times with the group variable set to the number of each indicator variable. This step is important because proc rank will rank groups independently.
Generate the rankings on the temp data set. Each group will be ranked independently of the other groups and can be done in one step.
Construct the final data set by essentially transposing the ranked data into columns.
%macro rank(indicators=, rank_vars=);
%let cnt_ind = %sysfunc(countw(&indicators));
%let cnt_vars = %sysfunc(countw(&rank_vars));
data temp;
set have;
array indicators(*) &indicators;
do i = 1 to dim(indicators);
if indicators(i) = 1 then do;
group = i; * create a group based on order of indicators;
output; * an obs can be output multiple times;
end;
end;
drop i &indicators;
run;
proc sort data=temp;
by group;
run;
* Generate rankings by group;
proc rank data=temp out=ranks;
by group;
var &rank_vars;
ranks
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
proc sort data=ranks;
by name group;
run;
* Contruct final data set by transposing the ranks into columns;
data want;
set ranks;
by name;
* retain statement to declare new variables and retain values;
retain
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let vars = &vars &ivar._&jvar._rank;
%end;
%end;
&vars;
if first.name then call missing (of &vars);
* option 1: build series of IF statements;
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%str(if group = &i then do;)
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let newvar = &ivar._&jvar._rank;
%str(&newvar = &jvar._rank;)
%end;
%str(end;)
%end;
if last.name then output;
drop group
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
%mend;
When constructing the final data set and transposing the rank variables, there are a couple of options. The first option shown above is to dynamically build a series of if statements. Here is what the code generates:
MPRINT(RANK): * option 1: build series of IF statements;
MPRINT(RANK): if group = 1 then do;
MPRINT(RANK): firstvar_age_rank = age_rank;
MPRINT(RANK): firstvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 2 then do;
MPRINT(RANK): secondvar_age_rank = age_rank;
MPRINT(RANK): secondvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 3 then do;
MPRINT(RANK): thirdvar_age_rank = age_rank;
MPRINT(RANK): thirdvar_height_rank = height_rank;
MPRINT(RANK): end;
The 2nd option is to use an array and mathematically calculate the index into the array by the group number and variable number. Here is the snippet of macro code to replace the if series code:
* option 2: create arrays and calculate index into array
* by group number and variable number;
array ranks(*) &vars;
array rankvars(*)
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
%str(idx = dim(rankvars) * (group - 1);)
%str(do i = 1 to dim(rankvars);)
%str(ranks(idx + i) = rankvars(i);)
%str(end;)
Here is the generated code:
MPRINT(RANK): * option 2: create arrays and calculate index into array * by group number and variable number;
MPRINT(RANK): array ranks(*) firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank
thirdvar_height_rank;
MPRINT(RANK): array rankvars(*) age_rank height_rank;
MPRINT(RANK): idx = dim(rankvars) * (group - 1);
MPRINT(RANK): do i = 1 to dim(rankvars);
MPRINT(RANK): ranks(idx + i) = rankvars(i);
MPRINT(RANK): end;
It takes a minute to understand the array option, but once you do, it is preferable over generating if statments. As the number of variables increases, the code generated by the array option is the same and operates more efficiently.

Add new empty rows to a SAS table with names from another table

Assume I have table foo which contains a (dynamic) list of new rows which I want to add to another table have, so that it yields a table want looking e.g. like this:
x y p_14 p_15
1 2 2 99
2 4 7 24
Example data for foo:
id row_name
14 p_14
15 p_15
Example data for have:
x y p Z
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
What I have so far is the following which is not yet in macro shape:
proc sql;
create table want as
select old.*, t1.p_14, t2.p_15 /* choosing non-duplicate rows */
from (select x, y from have) old
left join (select x, y, z as p_14 from have where p=14) t1
on old.x=t1.x and old.y=t1.y
left join (select x, y, z as p_15 from have where p=15) t2
on old.x=t2.x and old.y=t2.y
;
quit;
Ideally, I am aiming for a macro where which takes foo as input and automatically creates all the joins from above. Also, the solution should not spit out any warnings in the console. My challenge is how to dynamically choose the correct (non-duplicate) rows.
PS: This is a follow-up question of Populate SAS macro-variable using a SQL statement within another SQL statement? The important bit is that it is not a full transpose, I guess.

You can go from HAVE to WANT with PROC TRANSPOSE.
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
by x y ;
id p ;
var z;
run;
To limit it to the values of P that occur in FOO you could use a macro variable (as long as the number of observations in FOO is small enough).
proc sql noprint ;
select id into :idlist separated by ' ' from foo ;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
If the issue is you want variable P_17 to be in the result even if 17 does not appear in HAVE then add a little more complexity. For example add another data step that will force the creation of the empty variables. You can generate the list of variable names from the list of id's in FOO.
proc sql noprint ;
select id , cats('p_',id)
into :idlist separated by ' '
, :varlist separated by ' '
from foo
;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
data want ;
set want (keep=x y);
array all &varlist ;
set want ;
run;
Results:
Obs x y p_14 p_15 p_17
1 1 2 2 99 .
2 2 4 7 24 .
If the number of values is too large to store in a single macro variable (limit 64K bytes) you could generate the WHERE statement with a data step to a file and use %INCLUDE to add the WHERE statement into the code.
filename where temp;
data _null_;
set foo end=eof;
file where ;
if _n_=1 then put 'where p in (' #;
put id # ;
if eof then put ');' ;
run;
proc transpose ... ;
%include where / source2;
...

Use macro program:
data have;
input x y p Z;
cards;
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
;
data foo;
input id row_name $;
cards;
14 p_14
15 p_15
;
%macro test(dsn);
proc sql;
select count(*) into:n trimmed from &dsn;
select id into: value separated by ' ' from &dsn;
create table want as
select distinct a.x,a.y,
%do i=1 %to &n;
%let cur=%scan(&value,&i);
t&i..p_&cur
%if &i<&n %then ,;
%else ;
%end;
from have a
%do i=1 %to &n;
%let cur=%scan(&value,&i);
left join have (where=(p=&cur) rename=(z=p_&cur.)) t&i.
on a.x=t&i..x and a.y=t&i..y
%end;
;
quit;
%mend;
%test(foo);

How to recode values of a variable based on the maxmium value in the variable, for hundreds of variables?

I want to recode the max value of a variable as 1 and 0 when it is not. For each variable, there may be multiple observations with the max value. The max value for each value is not fixed, i.e. from cycle to cycle the max value for each variable may change. And there are hundreds of variables, cannot "hard-code" anything.
The final product would have the same dimensions as the original table, i.e. equal number of rows and columns as a matrix of 0s and 1s.
This is within SAS. I attempted to calculate the max of each variable and then append these max as a new observation into the data. Then comparing down the column of each variable against the "max" observation... looking into examples of the following did not help:
SQL
Array in datastep
proc transpose
formatting
Any insight would be much appreciated.

Here is a version done with SQL:
The idea is that we first calculate the maximum. The Latter select. Then we join the data to original and the outer the case-select specifies if the flag is set up or not.
data begin;
input var value;
cards;
1 1
1 2
1 3
1 2.5
1 1.7
1 3
2 34
2 33
2 33
2 33.7
2 34
2 34
; run;
proc sql;
create table result as
select a.var, a.value, case when a.value = b.maximum then 1 else 0 end as is_max from
(select * from begin) a
left join
(select max(value) as maximum, var from begin group by var) b
on a.var = b.var
;
quit;

To avoid "hard-code" you need to use some code generation.
First let's figure out what code you could use to solve the problem. Later we can look into ways to generate that code.
It is probably easiest to do this with PROC SQL code. SAS will allow you to reference the MAX() value of a variable. Also note that SAS evaluates boolean expressions to 1 (TRUE) or 0 (FALSE). So you just want to generate code like:
proc sql;
create table want as
select var1=max(var1) as var1
, var2=max(var2) as var2
from have
;
quit;
To generate the code you need a list of the variables in your source dataset. You can get those with PROC CONTENTS but also with the metadata table (view) DICTIONARY.COLUMNS (also accessible as SASHELP.VCOLUMN from outside PROC SQL).
If the list of variables is small then you could generate the code into a single macro variable.
proc sql noprint;
select catx(' ',cats(name,'=max(',name,')'),'as',name)
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='HAVE'
order by varnum
;
create table want as
select &varlist
from have
;
quit;
The maximum number of characters that will fit into a macro variable is 64K. So long enough for about 2,000 variables with names of 8 characters each.
Here is little more complex way that uses PROC SUMMARY and a data step with a temporary array. It does not really need any code generation.
%let dsin=sashelp.class(obs=10);
%let dsout=want;
%let varlist=_numeric_;
proc summary data=&dsin nway ;
var &varlist;
output out=summary(drop=_type_ _freq_) max= ;
run;
data &dsout;
if 0 then set &dsin;
array vars &varlist;
array max [10000] _temporary_;
if _n_=1 then do;
set summary ;
do _n_=1 to dim(vars);
max[_n_]=vars[_n_];
end;
end;
set &dsin;
do _n_=1 to dim(vars);
vars[_n_]=vars[_n_]=max[_n_];
end;
run;
Results:
Obs Name Sex Age Height Weight
1 Alfred M 0 1 1
2 Alice F 0 0 0
3 Barbara F 0 0 0
4 Carol F 0 0 0
5 Henry M 0 0 0
6 James M 0 0 0
7 Jane F 0 0 0
8 Janet F 1 0 1
9 Jeffrey M 0 0 0
10 John M 0 0 0

Using SAS, is it possible to get a frequency table where no data exist?

This is a follow-up to my previous post on SO.
I am trying to produce a frequency table of demographics, including race, sex, and ethnicity. One table is a crosstab of race by sex for Hispanic participants in a study. However, there are no Hispanic participants thus far. So, the table will be all zeroes, but we still have to report it.
This can be done in R, but so far, I have found no solution for SAS. Example data is below.
data race;
input race eth sex ;
cards;
1 2 1
1 2 1
1 2 2
2 2 1
2 2 2
2 2 1
3 2 2
3 2 2
3 2 1
4 2 2
4 2 1
4 2 2
run;
data class;
do race = 1,2,3,4,5,6,7;
do eth = 1,2,3;
do sex = 1,2;
output;
end;
end;
end;
run;
proc format;
value frace 1 = "American Indian / AK Native"
2 = "Asian"
3 = "Black or African American"
4 = "Native Hawiian or Other PI"
5 = "White"
6 = "More than one race"
7 = "Unknown or not reported" ;
value feth 1 = "Hispanic or Latino"
2 = "Not Hispanic or Latino"
3 = "Unknown or Not reported" ;
value fsex 1 = "Male"
2 = "Female" ;
run;
***** ethnicity by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table eth, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex, for Hispanic only ;
***** log indicates that a logical page with only missing values has been deleted ;
***** Thanks SAS, you're a big help... ;
proc tabulate data = race missing classdata=class ;
where eth = 1 ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
I understand that the code really can't work because I'm selecting where eth is equal to 1 (there are no cases satisfying the condition...). Specifying the command to be run by eth doesn't work either.
Any guidance is greatly appreciated...

I think the easiest way is to create a row in the data that has the missing value. You could look at the following paper for suggestions as to how to do this on a larger scale:
http://www.nesug.org/Proceedings/nesug11/pf/pf02.pdf
PROC FREQ has the SPARSE option, which gives you all possible combinations of all variables in the table (including missing ones), but it doesn't look like that gives you exactly what you need.

Looks like our good friends at Westat have worked with this issue. A description of there solution is shown here.
The code is shown below for convenience, but please cite the original when referenced
PROC FORMAT;
value ethnicf
1 = 'Hispanic or Latino'
2 = 'Not Hispanic or Latino'
3 = 'Unknown (Individuals Not Reporting Ethnicity)';
value racef
1 = 'American Indian or Alaska Native'
2 = 'Asian'
3 = 'Native Hawaiian or Other Pacific Islander'
4 = 'Black or African American'
5 = 'White'
6 = 'More Than One Race'
7 = 'Unknown or Not Reported';
value gndrf
1 = 'Male'
2 = 'Female'
3 = 'Unknown or Not Reported';
RUN;
DATA shelldata;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
do ethcat = 1 to 2;
do ethlbl = 1 to 3;
do racelbl = 1 to 7;
do gender = 1 to 3;
output;
end;
end;
end;
end;
RUN;
DATA test;
input pt $ 1-3 ethlbl gender racelbl ;
cards;
x1 2 1 5
x2 2 1 5
x3 2 1 5
x4 2 1 5
x5 2 1 5
x6 2 2 2
x7 2 2 2
x8 2 2 5
x9 2 2 4
x10 2 2 4
RUN;
DATA enroll;
set test;
if ethlbl = 1 then ethcat = 1;
else ethcat = 2;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
label ethlbl = 'Ethnic Category'
racelbl = 'Racial Categories'
gender = 'Sex/Gender';
RUN;
%MACRO TAB_WHERE;
/* PROC SQL step creates a macro variable whose */
/* value will be the number of observations */
/* meeting WHERE clause criteria. */
PROC SQL noprint;
select count(*)
into :numobs
from enroll
where ethcat=1;
QUIT;
/* PROC FORMAT step to display all numeric values as zero. */
PROC FORMAT;
value allzero low-high=' 0';
RUN;
/* Conditionally execute steps when no observations met criteria. */
%if &numobs=0 %then
%do;
%let fmt = allzero.; /* Print all cell values as zeroes */
%let str = ; /*No Cases in Subset - WHERE cannot be used */
%end;
%else
%do;
%let fmt = 8.0;
%let str = where ethcat = 1;
%end;
PROC TABULATE data=enroll classdata=shelldata missing format=&fmt;
&str;
format racelbl racef. gender gndrf.;
class racelbl gender;
classlev racelbl gender;
keyword n pctn all;
tables (racelbl all='Racial Categories: Total of Hispanic or Latinos'),
gender='Sex/Gender'*N=' ' all='Total'*n='' / printmiss misstext='0'
box=[LABEL=' '];
title1 font=arial color=darkblue h=1.5 'Inclusion Enrollment Report';
title2 ' ';
title3 font=arial color=darkblue h=1' PART B. HISPANIC ENROLLMENT REPORT:
Number of Hispanic or Latinos Enrolled to Date (Cumulative)';
RUN;
%MEND TAB_WHERE;
%TAB_WHERE

I found this paper to be very informative:
Oh No, a Zero Row: 5 Ways to Summarize Absolutely Nothing
The preloadfmt option in proc means (Method 5) is my favorite. Once you create the necessary formats it's not necessary to add dummy data. It's odd that they haven't yet added this option to proc freq.

Split SAS dataset

I have a SAS dataset that looks like this:
id | dept | ...
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 B
10 B
11 B
12 B
13 B
Each observation represents a person.
I would like to split the dataset into "team" datasets, each dataset can have a maximum of 3 observations.
For the example above this would mean creating 3 datasets for dept A (2 of these datasets would contain 3 observations and the third dataset would contain 2 observations). And 2 datasets for dept B (1 containing 3 observations and the other containing 2 observations).
Like so:
First dataset (deptA1):
id | dept | ...
1 A
2 A
3 A
Second dataset (deptA2)
id | dept | ...
4 A
5 A
6 A
Third dataset (deptA3)
id | dept | ...
7 A
8 A
Fourth dataset (deptB1)
id | dept | ...
9 B
10 B
11 B
Fifth dataset (deptB2)
id | dept | ...
12 B
13 B
The full dataset I'm using contains thousands of observations with over 50 depts. I can work out how many datasets per dept are required and I think a macro is the best way to go as the number of datasets required is dynamic. But I can't figure out the logic to create the datasets so that they have have a maximum of 3 observations. Any help appreciated.

Another version.
Compared to DavB version, it only processes input data once and splits it into several tables in single datastep.
Also if more complex splitting rule is required, it can be implemented in datastep view WORK.SOURCE_PREP.
data WORK.SOURCE;
infile cards;
length ID 8 dept $1;
input ID dept;
cards;
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 B
10 B
11 B
12 B
13 B
14 C
15 C
16 C
17 C
18 C
19 C
20 C
;
run;
proc sort data=WORK.SOURCE;
by dept ID;
run;
data WORK.SOURCE_PREP / view=WORK.SOURCE_PREP;
set WORK.SOURCE;
by dept;
length table_name $32;
if first.dept then do;
count = 1;
table = 1;
end;
else count + 1;
if count > 3 then do;
count = 1;
table + 1;
end;
/* variable TABLE_NAME to hold table name */
TABLE_NAME = catt('WORK.', dept, put(table, 3. -L));
run;
/* prepare list of tables */
proc sql noprint;
create table table_list as
select distinct TABLE_NAME from WORK.SOURCE_PREP where not missing(table_name)
;
%let table_cnt=&sqlobs;
select table_name into :table_list separated by ' ' from table_list;
select table_name into :tab1 - :tab&table_cnt from table_list;
quit;
%put &table_list;
%macro loop_when(cnt, var);
%do i=1 %to &cnt;
when ("&&&var.&i") output &&&var.&i;
%end;
%mend;
data &table_list;
set WORK.SOURCE_PREP;
select (TABLE_NAME);
/* generate OUTPUT statements */
%loop_when(&table_cnt, tab)
end;
run;

You could try this:
%macro split(inds=,maxobs=);
proc sql noprint;
select distinct dept into :dept1-:dept9999
from &inds.
order by dept;
select ceil(count(*)/&maxobs.) into :numds1-:numds9999
from &inds.
group by dept
order by dept;
quit;
%let numdept=&sqlobs;
data %do i=1 %to &numdept.;
%do j=1 %to &&numds&i;
dept&&dept&i&&j.
%end;
%end;;
set &inds.;
by dept;
if first.dept then counter=0;
counter+1;
%do i=1 %to &numdept.;
%if &i.=1 %then %do;
if
%end;
%else %do;
else if
%end;
dept="&&dept&i" then do;
%do k=1 %to &&numds&i.;
%if &k.=1 %then %do;
if
%end;
%else %do;
else if
%end;
counter<=&maxobs.*&k. then output dept&&dept&i&&k.;
%end;
end;
%end;
run;
%mend split;
%split(inds=YOUR_DATASET,maxobs=3);
Just replace the INDS parameter value in the %SPLIT macro call to the name of your input data set.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Divide a dataset into subsets based on a column and perform a repeated operation for subsets - sas

Related

Loop through SAS variables and create data sets

Add new empty rows to a SAS table with names from another table

How to recode values of a variable based on the maxmium value in the variable, for hundreds of variables?

Using SAS, is it possible to get a frequency table where no data exist?

Split SAS dataset

Categories

Resources