Calculate growth rates for several variables at the same time - sas

I would like to calculate the growth rate of several variables without doing it manually. Is there any smart way to do it?
E.g. See table below from sashelp.citiyr that looks like:
DATE PAN PAN17 PAN18 PANF PANM
1980 227757 172456 138358 116869 110888
1981 230138 175017 140618 118074 112064
1982 232520 177346 142740 119275 113245
1983 234799 179480 144591 120414 114385
1984 237001 181514 146257 121507 115494
1985 239279 183583 147759 122631 116648
1986 241625 185766 149149 123795 117830
1987 243942 187988 150542 124945 118997
1988 246307 189867 152113 126118 120189
1989 248762 191570 153695 127317 121445
I can create the growth rate of the variables as follows (example for first column PAN) but I would like a way to compute it for all the variables (or those I want, imagine a case with dozens of them).
data test;
set sashelp.citiyr;
by date;
Pan_growth = PAN / lag(PAN);
run;
Any idea how to make this smarter?

Transpose your data set and use BY group processing.
*transpose to a long format;
proc transpose data=have out=long (rename=col1=Value);
by date;
var pan--panm;
run;
*sort for each variable to be consistent;
proc sort data=long;
by _name_ date;
run;
*calculate lag;
data want;
set long;
by _name_;
prev_val = lag(value);
if not(first._name_) then growth = value/prev_value - 1;
run;

You could use macros. Also to avoid get warnings you need some conditional logic to prevent division in first row
data test;
set sashelp.citiyr;
by date;
%macro growth(var);
before=lag(&var);
if _N_>1 then &var._growth = &var. / before;
drop before;
%mend;
%growth(PAN);
%growth(PAN17);
%growth(PAN18);
%growth(PANF);
%growth(PANM);
run;

Use arrays.
data test;
set sashelp.citiyr;
array vars[*] pan pan17 pan18 panf panm;
array growth[*] pan_growth pan17_growth pan18_growth panf_growth panm_growth;
do i = 1 to dim(vars);
growth[i] = vars[i]/lag(vars[i]);
end;
run;
If your variables all start with a certain prefix, end in sequential numbers, or are always in the exact same order, you can save even more time by using variable list shortcuts.
If you have an even more complex case where you have hundreds of variables that aren't in the right order or have no simple pattern, you can generate the desired names and save them into macro lists using SQL and dictionary.columns. Just make sure you exclude any irrelevant variables from your query.
proc sql noprint;
select name
, cats(name, '_growth')
into :vars separated by ' '
, :growth_vars separated by ' '
from dictionary.columns
where libname = 'SASHELP'
AND memname = 'CITIYR'
AND upcase(name) NE 'DATE'
;
quit;
data test2;
set sashelp.citiyr;
array vars[*] &vars.;
array growth[*] &growth_vars.;
do i = 1 to dim(vars);
growth[i] = vars[i]/lag(vars[i]);
end;
run;

Related

SaS 9.4: How to use different weights on the same variable without datastep or proc sql

I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here

SAS - creating indicator variables

I'm using SAS and I'd like to create an indicator variable.
The data I have is like this (DATA I HAVE):
and I want to change this to (DATA I WANT):
I have a fixed number of total time that I want to use, and the starttime has duplicate time value (in this example, c1 and c2 both started at time 3). Although the example I'm using is small with 5 names and 12 time values, the actual data is very large (about 40,000 names and 100,000 time values - so the outcome I want is a matrix with 100,000x40,000.)
Can someone please provide any tips/solution on how to handle this?
40k variables is a lot. It will be interesting to see how well this scales. How do you determine the stop time?
data have;
input starttime name :$32.;
retain one 1;
cards;
1 varx
3 c1
3 c2
5 c3x
10 c4
11 c5
;;;;
run;
proc print;
run;
proc transpose data=have out=have2(drop=_name_ rename=(starttime=time));
by starttime;
id name;
var one;
run;
data time;
if 0 then set have2(drop=time);
array _n[*] _all_;
retain _n 0;
do time=.,1 to 12;
output;
call missing(of _n[*]);
end;
run;
data want0 / view=want0;
merge time have2;
by time;
retain dummy '1';
run;
data want;
length time 8;
update want0(obs=0) want0;
by dummy;
if not missing(time);
output;
drop dummy;
run;
proc print;
run;
This will work. There may be a simpler solution that does it all in one data step. My data step creates a staggered results that has to be collapsed which I do by summing in the sort/means.
data have;
input starttime name $;
datalines;
3 c1
3 c2
5 c3
10 c4
11 c5
;
run;
data want(drop=starttime name);
set have;
array cols (*) c1-c5;
do time=1 to 100;
if starttime < time then cols(_N_)=1;
else cols(_N_)=0;
output;
end;
run;
proc sort data=want;
by time;
proc means data=want noprint;
by time;
var _numeric_;
output out=want2(drop=_type_ _freq_) sum=;
run;
I am not recommending you do it this way. You didn't provide enough information to let us know why you want a matrix of that size. You may have processing issues getting it to run.
In the line do time=1 to 100 you can change that to 100000 or whatever length.
I think the code below will work:
%macro answer_macro(data_in, data_out);
/* Deduplication of initial dataset just to assure that every variable has a unique starting time*/
proc sort data=&data_in. out=data_have_nodup; by name starttime; run;
proc sort data=data_have_nodup nodupkey; by name; run;
/*Getting min and max starttime values - here I am assuming that there is only integer values form starttime*/
proc sql noprint;
select min(starttime)
,max(starttime)
into :min_starttime /*not used. Use this (and change the loop on the next dataset) to start the time variable from the value where the first variable starts*/
,:max_starttime
from data_have_nodup
;quit;
/*Getting all pairs of name/starttime*/
proc sql noprint;
select name
,starttime
into :name1 - :name1000000
,:time1 - :time1000000
from data_have_nodup
;quit;
/*Getting total number of variables*/
proc sql noprint;
select count(*) into :nvars
from data_have_nodup
;quit;
/* Creating dataset with possible start values */
/*I'm not sure this step could be done with a single datastep, but I don't have SAS
on my PC to make tests, so I used the method below*/
data &data_out.;
do i = 1 to &max_starttime. + 1;
time = i; output;
end;
drop i;
run;
data &data_out.;
set &data_out.;
%do i = 1 %to &nvars.;
if time >= &&time&i then &&name&i = 1;
else &&name&i = 0;
%end;
run;
%mend answer_macro;
Unfortunately I don't have SAS on my machine right now, so I can't confirm that the code works. But even if it doesn't, you can use the logic in it.

SAS function to every observaton (finance xirr)

I have an sql table like this one
id | payment | date |
______|_____________|________________________|
obs1 | -20,10,13 | 21184,22765,22704 |
And so on (1M+ observation). I prepeared all the data for using finance() in SQL, so in SAS i just need to take them and pass to the function. I am confident, that the data i prepared will return right answer
The problem is that i can't find the most proper way to do caclulate the function on entire data. Right now i am going row by row in cycle and passing data to macro variables throught proc sql BUT i can't get string larger than 1000 characters, so my program isn't working.
I am running next function:
finance('XIRR', payment, date, 0.15);
Can you help me please? Thanks
The code i had before the answer. Worked unacceptable long!
%macro eir (input_data, cash_var, dt_var, output_data);
data rawdata;
set &input_data(dbmax_text=32000);
run;
proc sql noprint;
select count(*) into :n from rawdata ;
quit;
%let n = 100;
%do j=1 %to &n;
data x;
set rawdata(firstobs = &j obs= &j);
run;
proc sql noprint;
select &cash_var into: cf from x;
select &dt_var into: dt from x;
quit;
data x;
set x;
r= finance('xirr', &cf, &dt, 0.15);
drop &cash_var &dt_var;
run;
data out;
set %if &j>1 %then %do; out %end; x;
run;
%end;
proc append base = &output_data data=out;
run;
proc datasets nolist;
delete x out rawdata;
run;
%mend eir;
%eir(input_data = have, cash_var = pmt, dt_var = dt, output_data = ggg);
Took 20 minutes to calculate 50,000 rows
and now it's just
data want;
set have(dbmax_text=32000);
eir = input(resolve(catx(',','%sysfunc(finance(XIRR',pmt,dt,'0.15),hex16)')),hex16.);
run;
Took 6 minutes to calcuate 1,400,000 rows
Tom just saved our project =)
The FINANCE() function wants a list of values, not a character string. You could parse the string and convert the text back into numbers and pass those to the function. But if the lengths of the lists vary from observation to observation that will cause issues.
You could use the macro processor to help you. You can generate a call to %sysfunc(finance()) and read the generated string back into a numeric variable.
It also might work to pad the short lists with zero payments on the last recorded date.
Let's make some test data.
data have ;
infile cards dsd dlm='|' ;
length id $20 payment date $100 ;
input id payment date;
cards;
obs1 | -20,10,13 | 21184,22765,22704
obs2 | -20,10 | 21184,22765
;
Now let's try converting it two ways. One by creating numeric variables to pass to the FINANCE() function call and the other by generating %sysfunc(finance()) call so that we can make sure the %sysfunc() call is working properly.
data want;
set have ;
array v (3) _temporary_;
array d (3) _temporary_;
do i=1 to dim(v);
v(i)=coalesce(input(scan(payment,i,','),32.),0);
d(i)=input(scan(date,i,','),32.);
if missing(d(i)) and i>1 then d(i)=d(i-1);
end;
drop i;
value1=finance('XIRR',of v(*),of d(*),0.15);
value2=input(resolve(catx(',','%sysfunc(finance(XIRR',payment,date,'0.15),hex16)')),hex16.);
run;
Here's my best guess based on the limited details you've provided. I think you need to split out each date and payment into separate variables before you can call the finance function, e.g.:
data have;
infile datalines dlm='|';
input id :$8. amount :$20. date :$20.;
datalines;
obs1 | -20,10,13 | 21184,22765,22704
;
run;
data want;
set have;
array dates[3] d1-d3;
array amounts[3] a1-a3;
do i = 1 to 3;
amounts[i] = input(scan(amount, i, ','), 8.);
dates[i] = input(scan(date, i, ','), 8.);
end;
XIRR = finance('XIRR', of a1-a3, of d1-d3, 0.15);
run;
I suspect this will only work you have the same number of dates and payments in every row, otherwise you will run into array out of bounds issues or problems with the IRR calculation.

How can I extract the unique values of a variable and their counts in SAS

Suppose I have these data read into SAS:
I would like to list each unique name and the number of months it appeared in the data above to give a data set like this:
I have looked into PROC FREQ, but I think I need to do this in a DATA step, because I would like to be able to create other variables within the new data set and otherwise be able to manipulate the new data.
Data step:
proc sort data=have;
by name month;
run;
data want;
set have;
by name month;
m=month(lag(month));
if first.id then months=1;
else if month(date)^=m then months+1;
if last.id then output;
keep name months;
run;
Pro Sql:
proc sql;
select distinct name,count(distinct(month(month))) as months from have group by name;
quit;
While it's possible to do this in a data step, you wouldn't; you'd use proc freq or similar. Almost every PROC can give you an output dataset (rather than just print to the screen).
PROC FREQ data=sashelp.class;
tables age/out=age_counts noprint;
run;
Then you can use this output dataset (age_counts) as a SET input to another data step to perform your further calculations.
You can also use proc sql to group the variable and count how many are in that group. It might be faster than proc freq depending on how large your data is.
proc sql noprint;
create table counts as
select AGE, count(*) as AGE_CT from sashelp.class
group by AGE;
quit;
If you want to do it in a data step, you can use a Hash Object to hold the counted values:
data have;
do i=1 to 100;
do V = 'a', 'b', 'c';
output;
end;
end;
run;
data _null_;
set have end=last;
if _n_ = 1 then do;
declare hash cnt();
rc = cnt.definekey('v');
rc = cnt.definedata('v','v_cnt');
rc = cnt.definedone();
call missing(v_cnt);
end;
rc = cnt.find();
if rc then do;
v_cnt = 1;
cnt.add();
end;
else do;
v_cnt = v_cnt + 1;
cnt.replace();
end;
if last then
rc = cnt.output(dataset: "want");
run;
This is very efficient as it is a single loop over the data. The WANT data set contains the key and count values.

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!