Compute growth rate, improvements over PROC EXPAND - sas

I have a SAS dataset, sorted, which has two columns: PERIOD and MYMETRIC
For each row, I want to compute the growth rate of the 4 periods preceding, by using a linear regression. So the formula is basically
GROWTH RATE = Cov([MYMETRIC_lag_4,MYMETRIC_lag_3, MYMETRIC_lag_2, MYMETRIC_lag_1],[1,2,3,4])/Var([1,2,3,4])
I can do this in SAS through a proc expand to compute the lags, then a data step to compute the growth rate. I was wondering if there was a shorter way to do this? Especially if suddenly, I choose to include 8 points and not 4, I want to minimize the rework.

You can use a data step entirely. This assumes you're asking for the four previous rows. I'm not sure what [1,2,3,4] means, though, so you'll have to fill in exactly what that means in the growth rate.
%let numlags=4;
data want;
set have;
array lags[&numlags] _temporary_; *temporary arrays are retained!;
growth_rate = cov(of lags[*])/var(of lags[*]); *if you want cov of the 4 lags divided by var of the 4 lags;
*move things about;
do _t = 1 to dim(lags)-1;
lags[_t]=lags[_t+1];
end;
lags[dim(lags)] = MYMETRIC;
run;

Related

Summing Multiple lags in SAS using LAG

I'm trying to make a data step that creates a column in my table that has the sum of ten, fifteen, twenty and fortyfive lagged variables. What I have below works, but it is not practicle to write this code for the twenty and fortyfive summed lags. I'm new to SAS and can't find a good way to write the code. Any help would be greatly appreciated.
Here's what I have:
data averages;
set work.cuts;
sum_lag_ten = (lag10(col) + lag9(col) + lag8(col) + lag7(col) + lag6(col) + lag5(col) + lag4(col) + lag3(col) + lag2(col) + lag1(col));
run;
Proc EXPAND allows the easy calculation for moving statistics.
Technically it requires a time component, but if you don't have one you can make one up, just make sure it's consecutive. A row number would work.
Given this, I'm not sure it's less code, but it's easier to read and type. And if you're calculating for multiple variables it's much more scalable.
Transformout specifies the transformation, In this case a moving sum with a window of 10 periods. Trimleft/right can be used to ensure that only records with a full 10 days are included.
You may need to tweak these depending on what exactly you want. The third example under PROC EXPAND has examples.
Data have;
Set have;
RowNum = _n_;
Run;
Proc EXPAND data=have out=want;
ID rownum;
Convert col=col_lag10 / transformout=(MOVSUM 10 trimleft 9);
Run;
Documentation(SAS/STAT 14.1)
http://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_expand_examples04.htm
If you must do this in the datastep (and if you do things like this regularly, SAS/ETS has better tools for sure), I would do it like this.
data want;
set sashelp.steel;
array lags[20];
retain lags1-lags20;
*move everything up one;
do _i = dim(lags) to 2 by -1;
lags[_i] = lags[_i-1];
end;
*assign the current record value;
lags[1] = steel;
*now calculate sums;
*if you want only earlier records and NOT this record, then use lags2-lags11, or do the sum before the move everything up one step;
lag_sum_10 = sum(of lags1-lags10);
lag_sum_15 = sum(of lags1-lags15); *etc.;
run;
Note - this is not the best solution (I think a hash table is better), but this is better for a more intermediate level programmer as it uses data step variables.
I don't use a temporary array because you need to use variable shortcuts to do the sum; with temporary array you don't get that, unfortunately (so no way to sum just 1-10, you have to sum [*] only).

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

SAS: Calculating average (Euclidean) distance from each customer to all other customers in table

I have a SAS dataset of 60k customers with the following attributes:
1) customer number
2) X coordinate
3) Y coordinate
4) store visits
I need to calculate the average weighted distance from each customer to all the other customers in the table where each distance is weighted by the comparing customer's number of visits. As an example, the distance between Customer A & Customer B is 10. We would then weight that distance by Customer B's number of visits (2) which equals 5. This process would repeat for all other customers in the table and we would then average all of these weighted distances for each of the 60k customers.
I suppose the brute force way to do this is a Cartesian join (ie. create a 60k x 60k = 3.6 billion record table) but that will likely run out of memory or crash SAS. I have also thought of breaking this up into more manageable Cartesian joins (ie. 10 x 60K = 600k x 6000 iterations but this is likely to be quite time consuming -- maybe my only choice though). I'm guessing you guys/gals know a much better way to do this!
I appreciate all your suggestions.
Thanks for you help!
Bad news, there is no way to speed up this calculation (that I know of).
Good news is SAS won't crash or run out of memory if you do a Cartesian product. Other good news is doing this in a data step is faster than doing it in PROC SQL.
data test;
do cn=1 to 64000;
x = ceil(Ranuni(13)*100);
y = ceil(ranuni(13)*100);
visits = max(1,round(rannor(12)*3 + 8,1));
output;
end;
run;
sasfile test load;
data ave_dist(keep=cn ave_dist);
set test end=last;
dist=0;
td= 0;
total_visits=0;
do i=1 to n;
set test(rename=(cn=cn_2 x=x_2 y=y_2) drop=visits) point=i nobs=n;
if cn ^= cn_2 then do;
xx = (x-x_2);
yy = (y-y_2);
total_visits = total_visits + visits;
dist = sqrt(xx*xx + yy*yy);
if dist^= 0 then
dist = 1/dist;
else
dist = 100; /*Adjust to something that makes sense to your data*/
td = visits*dist + td;
end;
end;
ave_dist = td / total_visits;
output;
run;
sasfile test close;
I inverted the distance calculation. You want small distances to have a higher score. I made this a true visit weighted average.
This takes about 13 minutes to run on my laptop.
if your customers base is going to be <100k then PROC DISTANCE could be of help. Using the dataset created by #DomPazz you could run the following code and examine the results. In this case I'm only trying it out on the first 10K customers which runs in 16secs. Do not let that fool you into false sense of security. When you double the no. of customers the time taken goes by 4times.
(actual times: 10K - 16secs, 20K - 47 secs, 40K - 3mins...)
This procedure produce a NxN square matrix (where N is the no. of customers in your input dataset). You could try and experiment and see at what point SAS runs into RAM memory issues ( be sure to have plenty of hard drive space, at least in the order of 1.10*NxN*8bytes).
Each cell in the matrix represents ith customer's (in rows) distance with 'j'th customer (in columns). Once you get the distance it is a simple matter of multiplying the respective distances with the customer's visits and taking the average.
proc distance data = test(obs = 100)
OUT=test_distances(compress = binary)
METHOD= EUCLID shape = SQUARE
UNDEF=1000000
VARDEF=wdf;
var INTERVAL(x y)
;
copy cn visits;
run;
data avg_dist;
set train_distances;
array dist{*} dist:;
prod=0;
do i = 1 to dim(dist);
prod = visits*dist{i}+prod;
end;
avg_dist=prod/dim(dist);
dims=dim(dist);
drop i dist:
;
run;
proc sql;
drop table test_distances;
quit;
The type of problem you are looking to solve are generally known as k-nearest neighbour problems. There has been decades of research in this area and most often these are solved using special data-structures such as Kd-trees for performance. Most often one is interested in answering questions such as who are the top-10 (or K) closest customers to this customer I'm interested in? Another procedure which is very good for solving these type of problems efficiently is the PROC PMBR which supports both kd-tree and SAS's proprietary structure called the Rd-tree - look it up - you will only find a pdf document from SAS Eminer 4.3 days
The moment you are having to calculate distance between N*N items you are asking for trouble.
From reading your project description in the comments it appears that what you need is not calculate distance between every customer with every other customer but something like distance between every customer and every store.
This will dramatically improve your query performance since the dimensionality of the problem is greatly reduced.
Let's say you have N customers and S stores then you only need to calculate distance between N*S points. ( a simple data step will do the job as there is no need for a cartesian product nor specialised data structures)
From there you can look at, for each store in S what proportion of the customer's who shoped at that store live with in 1KM, 2KM, 3KM ....
Then you can come up answers such as 80% live within 1km , 15% live within 2KM etc...

SAS: backward looking data step to compute the average

Sorry for the "not really informative" title of this post.
I have the following data set in SAS:
time Add time_delete
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted.
Any help?
You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.
First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.
data have;
input time Add time_delete;
datalines;
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
;;;;
run;
data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
output;
end;
keep time add;
run;
proc means data=want mean max n;
class time;
var add;
run;
You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.
The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.
You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.
proc sql;
select H.time, mean(V.add), max(V.add) from (
select distinct H.time from have H
left join
(select * from have) V
on V.time le H.time
and V.time_delete gt H.time )
group by 1;
;
quit;
Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.
Other options:
Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.
Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.
Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.

SAS: How do I point to a specific observation of a value?

I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.