Attempting to Automatically Create Label of Years Used in an Analysis - sas

Say I had 5 years of data that were being used to calculate some measure across those aggregated years. Sometimes those are 5 consecutive years and other times data was not available for a given year so it must be skipped. For example 2016-2020 vs 2015-2017 & 2019-2020. In this case data was not available for 2018. I have been given a set of rules for how these years should be presented.
Consecutive years should be ex: 2016-2020
Non-Consecutive Years Will Look slightly different depending on where the missing year(s) occur.
2015-2017 & 2019-2020
2010, 2012, 2014, 2016 & 2017
2015-2018 & 2020
While it would be trivial just to produce a comma separated list of all years used this is how they want the years presented. These labels are for a series of different measures so I am attempting to create these labels automatically within a macro. The number of years of data is also not always 5. It could be 3 years or even 10 years.
The obvious first idea was a do until process that started at the minimum year and progressively compared against the next year used in the analysis looking to see if the years were consecutive. Given the number of years isn't consistently 5 this was what made the most sense so far but I have not worked with do until loops very much. As such I couldn't quite figure out how to progressivley build the label over the iterations of the do until loop while also adhering to these rules.
For this example lets use the years 2015,2016,2017,2019,2020.
Any help would be greatly appreciated.

This could be a case of a picture is worth a thousand words.
Example:
/* simulate raw results of a survey of 10 questions over 16 years */
data surveyresults;
call streaminit(20230125);
do qid = 1 to 10;
do year = 2007 to 2022;
if year = 2021 then continue;
if rand('uniform') > 0.85 then continue;
do _n_ = 1 to rand('integer', 30);
pid + 1;
if rand('uniform') > 0.85 then continue;
answercode = rand('integer', 20);
output;
end;
end;
end;
run;
proc sql noprint;
create table stage1 as
select distinct qid, year, 1 as flag
from surveyresults
order by qid, year
;
select catx(' ', min(year), 'to', max(year))
into :year_range
from stage1 ;
ods html file='plot.html';
proc sgplot data=stage1;
scatter x=year y=qid / markerattrs=(symbol=squarefilled size=8.2%);
xaxis values=(&year_range);
yaxis type=discrete;
run;
ods html close;

This should get you started.
data test;
infile cards dsd;
input x ##;
d = dif(x); /*used to create RUN when dif > 1 increment run*/
if d eq . or d > 1 then run+1;
cards;
2015,2016,2017,2019,2020,2022,2024,2025,2026
;;;;
run;
proc print;
run;
proc summary data=test nway; /*count the number of years in each run*/
class run;
output out=runlen(drop=_type_);
run;
data test; /* merge TEST and RUNLEN*/
length list $128;
do until(last.run); /*loop until last.run*/
merge test runlen;
by run;
if first.run then list = cats(x); /*start of list*/
end;
select(_freq_); /*based on run-length create LIST */
when(1);
when(2) list = catx(' & ',list,x);
otherwise list = catx('-',list,x);
end;
run;
proc print;
run;

Probably an easier way than this, but this works for your scenarios.
data years;
input year;
cards;
2015
2016
2017
2019
2020
;
run;
/* data years; */
/* input year; */
/* cards; */
/* 2010 */
/* 2012 */
/* 2014 */
/* 2016 */
/* 2017 */
/* ; */
/* run; */
/* data years; */
/* input year; */
/* cards; */
/* 2015 */
/* 2016 */
/* 2017 */
/* 2018 */
/* 2020 */
/* ; */
/* run; */
data want;
merge years end=eof years(firstobs=2 rename=year=next_year);
length year_list $200. interval $20.;;
retain year_list start_year;
_dif= next_year - year;
if _n_=1 then start_year=year;
if _dif > 1 or eof then do;
if start_year ne year then interval = catx('-', start_year, year);
else interval = put(start_year, 8. -l);
if eof then year_list=catx(" & ", year_list, interval);
else year_list = catx(", ", year_list, interval);
start_year = next_year;
end;
if eof then call symputx('year_list', year_list);
run;
%put &year_list;

This version creates the combined list. I think it has the features you describe.
data test;
infile cards dsd;
input x ##;
d = dif(x); /*used to create RUN when dif > 1 increment run*/
if d eq . or d > 1 then run+1;
cards;
2015,2016,2017,2019,2020,2022,2024,2025,2026
;;;;
run;
proc print;
run;
data list(keep=combinedlist);
length list $128 combinedList $256;
do while(not eof);
list=' ';
do runlength=1 by 1 until(last.run); /*loop until last.run*/
set test end=eof;
by run;
if first.run then list = cats(x); /*start of list*/
end;
select(runlength); /*based on run-length create LIST */
when(1);
when(2) list = catx(' & ',list,x);
otherwise list = catx('-',list,x);
end;
combinedList = catx(', ',combinedList,list);
end;
output;
stop;
run;
proc print;
run;

Related

how to vertically sum a range of dynamic variables in sas?

I have a dataset in SAS in which the months would be dynamically updated each month. I need to calculate the sum vertically each month and paste the sum below, as shown in the image.
Proc means/ proc summary and proc print are not doing the trick for me.
I was given the following code before:
`%let month = month name;
%put &month.;
data new_totals;
set Final_&month. end=end;
&month._sum + &month._final;
/*feb_sum + &month._final;*/
output;
if end then do;
measure = 'Total';
&month._final = &month._sum;
/*Feb_final = feb_sum;*/
output;
end;
drop &month._sum;
run; `
The problem is this has all the months hardcoded, which i don't want. I am not too familiar with loops or arrays, so need a solution for this, please.
enter image description here
It may be better to use a reporting procedure such as PRINT or REPORT to produce the desired output.
data have;
length group $20;
do group = 'A', 'B', 'C';
array month_totals jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over month_totals;
month_totals = 10 + floor(rand('uniform', 60));
end;
output;
end;
run;
ods excel file='data_with_total_row.xlsx';
proc print noobs data=have;
var group ;
sum jan2020--dec2019;
run;
proc report data=have;
columns group jan2020--dec2019;
define group / width=20;
rbreak after / summarize;
compute after;
group = 'Total';
endcomp;
run;
ods excel close;
Data structure
The data sets you are working with are 'difficult' because the date aspect of the data is actually in the metadata, i.e. the column name. An even better approach, in SAS, is too have a categorical data with columns
group (categorical role)
month (categorical role)
total (continuous role)
Such data can be easily filtered with a where clause, and reporting procedures such as REPORT and TABULATE can use the month variable in a class statement.
Example:
data have;
length group $20;
do group = 'A', 'B', 'C';
do _n_ = 0 by 1 until (month >= '01feb2020'd);
month = intnx('month', '01jan2018'd, _n_);
total = 10 + floor(rand('uniform', 60));
output;
end;
end;
format month monyy5.;
run;
proc tabulate data=have;
class group month;
var total;
table
group all='Total'
,
month='' * total='' * sum=''*f=comma9.
;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
proc report data=have;
column group (month,total);
define group / group;
define month / '' across order=data ;
define total / '' ;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
Here is a basic way. Borrowed sample data from Richard.
data have;
length group $20;
do group = 'A', 'B';
array months jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over months;
months = 10 + floor(rand('uniform', 60, 1));
end;
output;
end;
run;
proc summary data=have;
var _numeric_;
output out=temp(drop=_:) sum=;
run;
data want;
set have temp (in=t);
if t then group='Total';
run;

SAS - creating indicator variables

I'm using SAS and I'd like to create an indicator variable.
The data I have is like this (DATA I HAVE):
and I want to change this to (DATA I WANT):
I have a fixed number of total time that I want to use, and the starttime has duplicate time value (in this example, c1 and c2 both started at time 3). Although the example I'm using is small with 5 names and 12 time values, the actual data is very large (about 40,000 names and 100,000 time values - so the outcome I want is a matrix with 100,000x40,000.)
Can someone please provide any tips/solution on how to handle this?
40k variables is a lot. It will be interesting to see how well this scales. How do you determine the stop time?
data have;
input starttime name :$32.;
retain one 1;
cards;
1 varx
3 c1
3 c2
5 c3x
10 c4
11 c5
;;;;
run;
proc print;
run;
proc transpose data=have out=have2(drop=_name_ rename=(starttime=time));
by starttime;
id name;
var one;
run;
data time;
if 0 then set have2(drop=time);
array _n[*] _all_;
retain _n 0;
do time=.,1 to 12;
output;
call missing(of _n[*]);
end;
run;
data want0 / view=want0;
merge time have2;
by time;
retain dummy '1';
run;
data want;
length time 8;
update want0(obs=0) want0;
by dummy;
if not missing(time);
output;
drop dummy;
run;
proc print;
run;
This will work. There may be a simpler solution that does it all in one data step. My data step creates a staggered results that has to be collapsed which I do by summing in the sort/means.
data have;
input starttime name $;
datalines;
3 c1
3 c2
5 c3
10 c4
11 c5
;
run;
data want(drop=starttime name);
set have;
array cols (*) c1-c5;
do time=1 to 100;
if starttime < time then cols(_N_)=1;
else cols(_N_)=0;
output;
end;
run;
proc sort data=want;
by time;
proc means data=want noprint;
by time;
var _numeric_;
output out=want2(drop=_type_ _freq_) sum=;
run;
I am not recommending you do it this way. You didn't provide enough information to let us know why you want a matrix of that size. You may have processing issues getting it to run.
In the line do time=1 to 100 you can change that to 100000 or whatever length.
I think the code below will work:
%macro answer_macro(data_in, data_out);
/* Deduplication of initial dataset just to assure that every variable has a unique starting time*/
proc sort data=&data_in. out=data_have_nodup; by name starttime; run;
proc sort data=data_have_nodup nodupkey; by name; run;
/*Getting min and max starttime values - here I am assuming that there is only integer values form starttime*/
proc sql noprint;
select min(starttime)
,max(starttime)
into :min_starttime /*not used. Use this (and change the loop on the next dataset) to start the time variable from the value where the first variable starts*/
,:max_starttime
from data_have_nodup
;quit;
/*Getting all pairs of name/starttime*/
proc sql noprint;
select name
,starttime
into :name1 - :name1000000
,:time1 - :time1000000
from data_have_nodup
;quit;
/*Getting total number of variables*/
proc sql noprint;
select count(*) into :nvars
from data_have_nodup
;quit;
/* Creating dataset with possible start values */
/*I'm not sure this step could be done with a single datastep, but I don't have SAS
on my PC to make tests, so I used the method below*/
data &data_out.;
do i = 1 to &max_starttime. + 1;
time = i; output;
end;
drop i;
run;
data &data_out.;
set &data_out.;
%do i = 1 %to &nvars.;
if time >= &&time&i then &&name&i = 1;
else &&name&i = 0;
%end;
run;
%mend answer_macro;
Unfortunately I don't have SAS on my machine right now, so I can't confirm that the code works. But even if it doesn't, you can use the logic in it.

stacking data in SAS

I am trying to rearrange my data but am having a difficult time. The data I have looks something like this:
date a b c
====================
1996 5 7 8
1997 4 2 3
1998 1 9 6
what I want is to rearrange the data (presumably using arrays) to get this:
date val var
=============
1996 5 a
1997 4 a
1998 1 a
1996 7 b
1997 2 b
1998 9 b
1996 8 c
1997 3 c
1997 6 c
So that I've essentially stacked the variables (a,b,c) along with the corresponding date and name of the variable.
Thanks in advance!
Use PROC TRANSPOSE to pivot the data.
First sort by DATE
proc sort data=have;
by date;
run;
Then use transpose
proc transpose data=have out=want(rename=(COL1=VAL _NAME_=VAR));
by date;
var a b c;
run;
Finally, it looks like you want this sorted by VAR, and then DATE
proc sort data=want;
by VAR date;
run;
Since you mention arrays, here's how you would achieve the result using them.
I would, however, use the proc transpose method in the answer from #DomPazz as procedures are generally easier to read and understand by others who may need to look at the code
/* create initial dataset */
data have;
input date a b c;
datalines;
1996 5 7 8
1997 4 2 3
1998 1 9 6
;
run;
/* transpose data */
data want;
set have;
array vars{*} a b c; /* create array of required values */
length val 8 var $8; /* set lengths of new variables */
do i = 1 to dim(vars); /* loop through each element of the array */
val = vars{i}; /* set val to be current array value */
var = vname(vars{i}); /* set var to be name of current array variable name */
drop a b c i; /* drop variables not required */
output; /* output each value to a new row */
end;
run;
/* sort data in required order */
proc sort data=want;
by var date;
run;

Calculate maximum difference between grouped rows

I have the following data where people in households are sorted by age (oldest to youngest):
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
I would like to calculate for each household the maximum age difference between consecutively aged people. So this example would give values of 5 (=25-20), 16 (=32-16) and 32 (=42-10) for households 1, 2 and 3 consecutively.
I could do this using lots of merges (i.e. extract person 1, merge with extract of person 2, and so on), but as there can be upto 20+ people in a household I'm looking for a much more direct method.
Here's a two pass solution. Same first step as the two solutions above, sort by age. In the second step keep track of max_diff per row, at the last record of HouseID output the results. This results in only two passes through the data.
proc sort data=houses; by houseid age;run;
data want;
set houses;
by houseID;
retain max_diff 0;
diff = dif1(age)*-1;
if first.HouseID then do;
diff = .; max_diff=.;
end;
if diff>max_diff then max_diff=diff;
if last.houseID then output;
keep houseID max_diff;
run;
proc sort data=houses; by houseid personid age;run;
data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;
proc sql;
create table want as
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
by houseid descending age;
run;
data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;
proc sql;
select houseid,max(age_diff) as max_age_diff
from house
group by houseid;
quit;
Working:
First sort the data set using houseid and descending Age.
Second data step will calculate difference between current age value (in PDV) and previous age value in PDV. Then, using sql procedure, we can get the max age difference for each houseid.
Just throwing one more into the mix. This one is a condensed version of Reeza's response.
/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
by HouseID Age;
run;
data want;
set houses;
by HouseID;
/* Keep the diff when a new row is loaded */
retain diff;
/* Only replace the diff if it is larger than previous */
diff = max(diff, abs(dif(Age)));
/* Reset diff for each new house */
if first.HouseID then diff = 0;
/* Only output the final diff for each house */
if last.HouseID;
keep HouseID diff;
run;
Here is an example using FIRST. and LAST. with one pass (after sort) through the data.
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
Proc sort data=HOUSES;
by houseid descending age ;
run;
Data WANT(keep=houseid max_diff);
format houseid max_diff;
retain max_diff age1 age2;
Set HOUSES;
by houseid descending age ;
if first.houseid and last.houseid then do;
max_diff=0;
output;
end;
else if first.houseid then do;
call missing(max_diff,age1,age2);
age1=age;
end;
else if not(first.houseid or last.houseid) then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
age1=age;
end;
else if last.houseid then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
output;
end;
Run;

how to swap first and last observations in sas data set if it contain 3 observations

I have a data set with 3 observations, 1 2 3
4 5 6
7 8 9 , now i have to interchange 1 2 3 and 7 8 9.
How can do this in base sas?
If you just want to sort your dataset by a variable in descending order, use proc sort:
data example;
input number;
datalines;
123
456
789
;
run;
proc sort data = example;
by descending number;
run;
If you want to re-order a dataset in a more complex way, create a new variable containing the position that you want each row to be in, and then sort it by that variable.
If you want to swap the contents of the first and last observations while leaving the rest of the dataset in place, you could do something like this.
data class;
set sashelp.class;
run;
data firstobs;
i = 1;
set sashelp.class(obs = 1);
run;
data lastobs;
i = nobs;
set sashelp.class nobs = nobs point = nobs;
output;
stop;
run;
data transaction;
set lastobs firstobs;
/*Swap the values of i for first and last obs*/
retain _i;
if _n_ = 1 then do;
_i = i;
i = 1;
end;
if _n_ = 2 then i = _i;
drop _i;
run;
data class;
set transaction(keep = i);
modify class point = i;
set transaction;
run;
This modifies just the first and last observations, which should be quite a bit faster than sorting or replacing a large dataset. You can do a similar thing with the update statement, but that only works if your dataset is already sorted / indexed by a unique key.
By Sandeep Sharma:sandeep.sharmas091#gmail.com
data testy;
input a;
datalines;
1
2
3
4
5
6
7
8
9
;
run;
data ghj;
drop y;
do i=nobs-2 to nobs;
set testy point=i nobs=nobs;
output;
end;
do n=4 to nobs-3;
set testy point=n;
output;
end;
do y=1 to 3;
set testy;
output;
end;
stop;
run;