SAS: pivot multiple columns to long format - sas

I have a dataset relating to pregnancy outcomes, where the outcomes for each baby is in wide format.
So, I have the columns:
Patient_ID *for the mother;
pofid_1
pof1pregenddate
pof1pregendweeks
pofid_2
pof2pregenddate
pof2pregendweeks
etc, etc.
pofid_1 refers to a unique identifier for each baby, and is the only variable that doesnt follow the format of pofnvarname (pof - pregnancy outcome form). There are ~50 columns for each baby, I have only listed three here for demonstration. Is there a way I can pivot the whole dataset based on the number after pof so I have the following column names, and one row for each baby born:
Patient_ID
babynumber
pofid *baby ID;
pofpregenddate
pofpregendweeks

You can perform a pivot of sets of grouped variables by using DATA Step arrays. The naming convention you are dealing with is unfortunately not very useful. Some pre-processing can be done to create a rename statement that moves the index # to the end of the variable name and then the array processing becomes very straightforward.
Example:
This example hard codes the grouped variables in array statements of the final step. Programmatic detection of variables that should be grouped (by common # in their name) is possible but more complicated.
data have(keep=id child:);
do id = 1 to 10;
z+1; childid1 = z;
z+1; child1metricX = z;
z+1; child1metricY = z;
z+1; childid2 = z;
z+1; child2metricX = z;
z+1; child2metricY = z;
z+1; childid3 = z;
z+1; child3metricX = z;
z+1; child3metricY = z;
output;
end;
run;
proc contents noprint data=have out=havevars;
run;
proc sql;
create table newnames as
select name, prxchange('s/child(\d+)([^ ]+)\s*/child$2_$1/i',-1,name) as newname
from havevars
where upcase(name) like 'CHILD%'
;
proc sql noprint;
* create rename option;
select
catx('=',name,newname)
, newname
into
:renames separated by ' '
, :drops separated by ' '
from newnames
;
quit;
data want;
set have (rename=(&renames));
array childids childid:;
array metricXs childmetricX:;
array metricYs childmetricY:;
do over childids;
childnum = _i_; * do over tacitly creates _i_;
childid = childids; * automatic implicit array index is _i_;
metricX = metricXs;
metrucY = metricYs;
output;
end;
drop &drops;
run;
Have:
Want (result):

It might be easiest to transpose ALL of them first. Then you could parse out the baby number from the name of the variable.
proc transpose data=have out=tall ;
by patientid;
var pof: ;
run;
data tall2;
set tall ;
_name_=upcase(_name_);
if _name_=:'POFID_' then do;
babynumber=input(scan(_name_,2,'_'),32.);
_name_='POFID';
end;
else do;
_name_=substr(_name_,4);
loc=verify(_name_,'0123456789');
babynumber=input(substr(_name_,1,loc-1),32.);
_name_=substr(_name_,loc);
end;
drop loc;
run;
If you want you could just leave it in this TALL format. Or you could sort and transpose it back into the semi-wide format.
proc sort data=tall2;
by patientid babynumber;
run;
proc transpose data=tall2 out=want;
by patientid babynumber;
id _name_;
var col1;
run;

Related

SAS Proc Report banded rows with skipped line

I am using PROC REPORT to generate an output. I need banded lines of alternate colours and am able to achieve this by incrementing a counter variable and testing to see if the row number is odd or even, this works as expected. I am also using a compute block to add a blank line after each group of order variables. I would like the background colour of the blank line to also be determined by the value of the counter variable, but this doesn't seem to be possible. I do not want to go down the route of adding the blank line to the dataset before running PROC REPORT, is there a solution. Please find code below:
PROC REPORT DATA = sashelp.class NOWD SPLIT = "!" HEADLINE HEADSKIP MISSING ;
COLUMN sex name ;
DEFINE sex / ORDER ;
***this adds banding to the rows and works as expected ***;
COMPUTE name;
count+1;
IF MOD(count, 2) gt 0 THEN DO;
CALL DEFINE(_ROW_,'STYLE','style=[background=red]');
END;
ELSE DO;
CALL DEFINE(_ROW_,'STYLE','style=[background=green]');
END;
ENDCOMP;
***section adds a blank line and I can control the background colour but I can t assign this colour based on the value of the count variable ***;
COMPUTE AFTER sex / style=[background=blue] ;
LINE " " ;
ENDCOMP;
RUN;
There is always the old way:
proc sort data = sashelp.class out = test;
by sex;
run;
data test;
set test;
by sex;
output;
if last.sex then do;
call missing(name);
output;
end;
run;
proc report data = test;
column sex name ord;
define sex /order order = data;
define ord /noprint;
compute name;
count + 1;
if mod(count, 2) then do;
call define(_row_,'style','style=[background=green]');
end;
else do;
call define(_row_,'style','style=[background=red]');
end;
endcomp;
run;
If you can solve it just by modifying an option, please share your skill.

SAS: Replace rare levels in variable with new level "Other"

I've got pretty big table where I want to replace rare values (for this example that have less than 10 occurancies but real case is more complicated- it might have 1000 levels while I want to have only 15). This list of possible levels might change so I don't want to hardcode anything.
My code is like:
%let var = Make;
proc sql;
create table stage1_ as
select &var.,
count(*) as count
from sashelp.cars
group by &var.
having count >= 10
order by count desc
;
quit;
/* Join table with table including only top obs to replace rare
values with "other" category */
proc sql;
create table stage2_ as
select t1.*,
case when t2.&var. is missing then "Other_&var." else t1.&var. end as &var._new
from sashelp.cars t1 left join
stage1_ t2 on t1.&var. = t2.&var.
;
quit;
/* Drop old variable and rename the new as old */
data result;
set stage2_(drop= &var.);
rename &var._new=&var.;
run;
It works, but unfortunately it is not very officient as it needs to make a join for each variable (in real case I am doing it in loop).
Is there a better way to do it? Maybe some smart replace function?
Thanks!!
You probably don't want to change the actual data values. Instead consider creating a custom format for each variable that will map the rare values to an 'Other' category.
The FREQ procedure ODS can be used to capture the counts and percentages of every variable listed into a single table. NOTE: Freq table/out= captures only the last listed variable. Those counts can be used to construct the format according to the 'othering' rules you want to implement.
data have;
do row = 1 to 1000;
array x x1-x10;
do over x;
if row < 600
then x = ceil(100*ranuni(123));
else x = ceil(150*ranuni(123));
end;
output;
end;
run;
ods output onewayfreqs=counts;
proc freq data=have ;
table x1-x10;
run;
data count_stack;
length name $32;
set counts;
array x x1-x10;
do over x;
name = vname(x);
value = x;
if value then output;
end;
keep name value frequency;
run;
proc sort data=count_stack;
by name descending frequency ;
run;
data cntlin;
do _n_ = 1 by 1 until (last.name);
set count_stack;
by name;
length fmtname $32;
fmtname = trim(name)||'top';
start = value;
label = cats(value);
if _n_ < 11 then output;
end;
hlo = 'O';
label = 'Other';
output;
run;
proc format cntlin=cntlin;
run;
ods html;
proc freq data=have;
table x1-x10;
format
x1 x1top.
x2 x2top.
x3 x3top.
x4 x4top.
x5 x5top.
x6 x6top.
x7 x7top.
x8 x8top.
x9 x9top.
x10 x10top.
;
run;

SAS Array <array-elements> to jump by 10

I want to achieve the same output but instead of harcoding each of the array-element use something like var1 - var10 but that would jump by 10 like decades.
data work.test(keep= statename pop_diff:);
set sashelp.us_data(keep=STATENAME POPULATION:);
array population_array {*} POPULATION_1910 -- POPULATION_2010;
dimp = dim(population_array);
/* here and below something like:
array pop_diff_amount {10} pop_diff_amount_1920 -- pop_diff_amount_2010;*/
array pop_diff_amount {10} pop_diff_amount_1920 pop_diff_amount_1930
pop_diff_amount_1940 pop_diff_amount_1950
pop_diff_amount_1960 pop_diff_amount_1970
pop_diff_amount_1980 pop_diff_amount_1990
pop_diff_amount_2000 pop_diff_amount_2010;
array pop_diff_prcnt {10} pop_diff_prcnt_1920 pop_diff_prcnt_1930
pop_diff_prcnt_1940 pop_diff_prcnt_1950
pop_diff_prcnt_1960 pop_diff_prcnt_1970
pop_diff_prcnt_1980 pop_diff_prcnt_1990
pop_diff_prcnt_2000 pop_diff_prcnt_2010;
do i=1 to dim(population_array) - 1;
pop_diff_amount{i} = population_array{i+1} - population_array{i};
pop_diff_prcnt{i} = (population_array{i+1} / population_array{i} -1) * 100;
end;
RUN;
I am still beginner in it therefore I am not sure is this possible or easy to achieve.
Thanks!
Not automatic but not all that difficult either. First create a data set of the names then transpose and use an unexecuted set to bring in the names and then define arrays. Note how arrays are define using [*] and name: as you did with population_array.
data names;
do type = 'Amount','Prcnt';
do year=1920 to 2010 by 10;
length _name_ $32;
_name_ = catx('_','pop_diff',type,year);
output;
end;
end;
run;
proc print;
run;
proc transpose data=names out=pop_diff(drop=_name_);
var;
run;
proc contents varnum;
run;
data pop;
set sashelp.us_data(keep=STATENAME POPULATION:);
array population_array {*} POPULATION_1910 -- POPULATION_2010;
if 0 then set pop_diff;
array pop_diff_amount[*] pop_diff_amount:;
array pop_diff_prcnt[*] pop_diff_prcnt:;
do i=1 to dim(population_array) - 1;
pop_diff_amount{i} = population_array{i+1} - population_array{i};
pop_diff_prcnt{i} = (population_array{i+1} / population_array{i} -1) * 100;
end;
run;
proc print data=pop;
run;
SAS is automatically going to increment the array elements by 1. Here is an alternative solution that creates the variables using one extra step to create a set of macro variables that hold the desired variable names. Since you are basing them off of the variable POPULATION_<year>, we will simply grab the years from those variable names, create the variable names for the arrays that we want, and store them into a few macro variables.
proc sql noprint;
select cats('pop_diff_amount_', scan(name, -1, '_') )
, cats('pop_diff_prcnt_', scan(name, -1, '_') )
into :pop_diff_amount_vars separated by ' '
, :pop_diff_prcnt_vars separated by ' '
from dictionary.columns
where libname = 'SASHELP'
AND memname = 'US_DATA'
AND upcase(name) LIKE 'POPULATION_%'
;
quit;
data work.test(keep= statename pop_diff:);
set sashelp.us_data(keep=STATENAME POPULATION:);
array population_array {*} POPULATION_1910 -- POPULATION_2010;
dimp = dim(population_array);
array pop_diff_amount {*} &pop_diff_amount_vars.;
array pop_diff_prcnt {*} &pop_diff_prcnt_vars.;
do i=1 to dim(population_array) - 1;
pop_diff_amount{i} = population_array{i+1} - population_array{i};
pop_diff_prcnt{i} = (population_array{i+1} / population_array{i} -1) * 100;
end;
RUN;
Getting the data out of the meta data (create variable year) would make coding life easier.
proc transpose data=sashelp.us_data out=us_pop(rename=(col1=Population));
by statename;
var population_:;
run;
data us_pop;
set us_pop;
by statename;
year = input(scan(_name_,-1,'_'),4.);
pop_diff_amount=dif(population);
pop_diff_prcnt =(population/lag(population))-1;
format pop_diff_prcnt percent10.2;
if first.statename then call missing(of pop_diff_amount pop_diff_prcnt);
drop _:;
run;
proc print data=us_pop(obs=10);
run;

Suppress Subtotal in Proc report

I have a proc report that groups and does subtotals. If I only have one observation in the group, the subtotal is useless. I'd like to either not do the subtotal for that line or not do the observation there. I don't want to go with a line statement, due to inconsistent formatting\style.
Here's some sample data. In the report the Tiki (my cat) line should only have one line, either the obs from the data or the subtotal...
data tiki1;
name='Tiki';
sex='C';
age=10;
height=6;
weight=9.5;
run;
data test;
set sashelp.class tiki1;
run;
It looks like you are trying do something that proc report cannot achieve in one pass. If however you just want the output you describe here is an approach that does not use proc report.
proc sort data = test;
by sex;
run;
data want;
length sex $10.;
set test end = eof;
by sex;
_tot + weight;
if first.sex then _stot = 0;
_stot + weight;
output;
if last.sex and not first.sex then do;
Name = "";
sex = "Subtotal " || trim(sex);
weight = _stot;
output;
end;
keep sex name weight;
if eof then do;
Name = "";
sex = "Total";
weight = _tot;
output;
end;
run;
proc print data = want noobs;
run;
This method manually creates subtotals and a total in the dataset by taking rolling sums. If you wanted do fancy formatting you could pass this data through proc report rather than proc print, Joe gives an example here.

Indicator variable for maximum value by groups

Is there any more elegant way than that presented below for the following task:
to create Indicator Variables (below "MAX_X1" and "MAX_X2") whithin each group (below "key1") of multiple observation (below "key2") with value 1 if this observation corresponds to the maximum value of the variable in eache group and 0 otherwise
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc means data=have noprint;
by key1;
var x1 x2;
output out=max
max= / autoname;
run;
data want;
merge have max;
by key1;
drop _:;
run;
proc sql;
title "MAX";
select name into :MAXvars separated by ' '
from dictionary.columns
WHERE LIBNAME="WORK" AND MEMNAME="WANT" AND NAME like "%_Max"
order by name;
quit;
title;
data want; set want;
array MAX (*) &MAXvars;
array XVars (*) x1 x2;
array Indicators (*) MAX_X1 MAX_X2;
do i=1 to dim(MAX);
if XVars[i]=MAX[i] then Indicators[i]=1; else Indicators[i]=0;
end;
drop i;
run;
Thanks for any suggestion of optimization
Proc sql can be used with a group by statement to allow summary functions across values of a variable.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
proc sql;
create table want
as select
key1,
key2,
x1,
x2,
case
when x1 = max(x1) then 1
else 0 end as max_x1,
case
when x2 = max(x2) then 1
else 0 end as max_x2
from have
group by key1
order by key1, key2;
quit;
It is also possible to do this in a single data step, provided that you read the input dataset twice - this is an example of a double DOW-loop.
data have;
call streaminit(4321);
do key1=1 to 10;
do key2=1 to 5;
do x1=rand("uniform");
x2=rand("Normal");
output;
end;
end;
end;
run;
/*Sort by key1 (or generate index) if not already sorted*/
proc sort data = have;
by key1;
run;
data want;
if 0 then set have;
array xvars[3,2] x1 x2 x1_max_flag x2_max_flag t_x1_max t_x2_max;
/*1st DOW-loop*/
do _n_ = 1 by 1 until(last.key1);
set have;
by key1;
do i = 1 to 2;
xvars[3,i] = max(xvars[1,i],xvars[3,i]);
end;
end;
/*2nd DOW-loop*/
do _n_ = 1 to _n_;
set have;
do i = 1 to 2;
xvars[2,i] = (xvars[1,i] = xvars[3,i]);
end;
output;
end;
drop i t_:;
run;
This may be a bit complicated to understand, so here's a rough explanation of how it flows:
Read one by group with the first DOW-loop, updating rolling max variables as each row is read in. Don't output anything yet.
Now read the same by-group again using the second DOW-loop, checking to see whether each row is equal to the rolling max and outputting each row.
Go back to first DOW-loop, read the next by-group and repeat.