SAS: backward looking data step to compute the average - sas

Sorry for the "not really informative" title of this post.
I have the following data set in SAS:
time Add time_delete
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
Where the time correspond to a new added (Add) price in an auction at every 3minute. This price can get delete within the same time interval or later as shown in time_delete. My objective is to compute the average price from the Add field standing at every time. For instance, my average price at time=5 is (3.15+3.11)/2 since the 3.00 gets deleted within the interval. Then the average price standing at time=8 is (4.20+3.15+3.11)/3. As you can see, I have to look at the current time where I am standing and look back and see which price is still valid standing at time=8. Also, I would like to have a field where for every time I know the highest price available that was not deleted.
Any help?

You have a variant of a rolling sum here. There's no one straightforward solution (especially as you undoubtedly have a few complications not mentioned); but here are a few pointers.
First, you may want to change the format of your data. This is actually a relatively easy problem to solve if you have one row for each possible timepoint rather than just a single row.
data have;
input time Add time_delete;
datalines;
5 3.00 5
5 3.15 11
5 3.11 11
8 4.21 8
8 3.42 8
8 4.20 11
11 3.12 .
;;;;
run;
data want;
set have;
if time=time_delete then delete;
else do time=time to time_delete-1;
output;
end;
keep time add;
run;
proc means data=want mean max n;
class time;
var add;
run;
You could output the proc means to a dataset and have your maximum value plus the average value, and then either put that back on the main dataset or whatever you need.
The main downside to this is it's a much larger dataset, so if you're looking at hundreds of thousands of data points, this is not your best option likely.
You can also perform this in SQL without the extra rows, although this is where those "other complications" would potentially throw a wrench in things.
proc sql;
select H.time, mean(V.add), max(V.add) from (
select distinct H.time from have H
left join
(select * from have) V
on V.time le H.time
and V.time_delete gt H.time )
group by 1;
;
quit;
Fairly straightforward and quick query, except that if you have a lot of time values it might take some time to execute the join.
Other options:
Read the data into an array, with a second array tracking the delete points. This can get a bit complex as you probably need to sort your array by delete point - so rather than just adding a new record into the end, you need to move a bunch of records down. SAS isn't quite as friendly to this sort of operation as a c-type language would be.
Use a hash table solution. Somewhat less messy than an array, particularly as you can sort a hash table more easily than two separate arrays.
Use IML and vectors. Similar to the array solution but with more powerful manipulation techniques available.

Related

SAS: Adding aggregated data to same dataset

I'm migrating from SPSS to SAS.
I need to compute the sum of variable varX, separately by groups of variables varA varB, and add it as a new variable sumX to the same dataset.
In SPSS this is implemented easily with aggregate:
aggregate outfile *
/break varA varB
/SUMvarX = sum(varX).
can this be done in SAS?
There are a number of ways to do this, but the best way depends on your data.
For a typical use case, the PROC MEANS solution is what I'd recommend. It's not the fastest, but it gets the job done, and it has a lot lower opportunity of error - you're not really doing anything except match-merging afterwards.
Use the class statement instead of by in most cases; it shouldn't make much of a difference, but it's the purpose of class. by runs the analysis separately for each value of those variables; class runs one analysis grouping by all of those variables. It is more flexible and doesn't require a sorted dataset (though you would have to sort anyway for the later merge). class also lets you do multiple combinations - not just the nway combination you ask for here, but if you want it grouped just by a, just by b, and by a*b, you can get that (with class and types).
proc means data=have;
class a b;
var x;
output out=summary sum(x)=;
run;
data want;
merge have summary;
by a b;
run;
The DoW loop covered in Kermit's answer is a reasonable data step option also, though more risky in terms of programmer error; I'd use it only in particular cases where the dataset is very very large - more than fits in memory in summary size large - and performance was important.
If the data fits in memory, you can also use a hash table to do the summary, and that's what I'd do if the summary dataset fit comfortably in memory. This is too long for an answer here, but Data Aggregation using Hash Object is a good start for how to do that. Basically, you use a hash table to store the results of the summary (not the raw data), adding to it with each row, and then output the hash table at the end. A bit faster than the DoW loop, but slightly memory constrained (although if you used SPSS, you're much more memory constrained than this!). Also very easy to handle multiple combinations.
Another "programmer easy" way to do it is with SQL.
proc sql;
create want as
select *, sum(x) as sum_x
from have
group by a,b
;
quit;
This is not standard SQL, but SAS manages it - basically it does the two step process of the proc means and the merge, in one step. I like this in some ways (because it skips the intermediate dataset, even if it does actually make this dataset in the util folder, just cleans up for you automatically) and dislike it in others (it's not standard SQL so it will confuse people, and it leaves a note in the log - only a note, so not a big deal, but still).
Adding a note about SPSS -> SAS thinking. One of the bigger differences you'll see going from SPSS to SAS is that, in SPSS, you have one dataset, and you do stuff to it (mostly). You could save it as a different dataset, but you mostly don't until the end - all of your work really is just editing one dataset, in memory.
In SAS, you read datasets from disk and do stuff and then write them out, and if you're doing anything that is at the dataset level (like a summary), you mostly will do it separately and then recombine with the data in a later step. As such, it's very, very common to have lots of datasets - a program I just ran probably has a thousand. Not kidding! Don't worry about random temporary datasets being produced - it doesn't mean your code is not efficient. It's just how SAS works. There are times where you do have to be careful about it - like you have 150GB datasets or something - but if you're working with 5000 rows with 150 variables, your dataset is so small you could write it a thousand times without noticing a meaningful difference to your code execution time.
The big benefit to this style is that you have different datasets for each step, so if you go back and want to rerun part of your code, you can safely - knowing the predecessor dataset still exists, without having to rerun all of your code. It also lets you debug really easily since you can see each of the component parts.
It's a tradeoff for sure, because it does mean it takes a little longer to run the code, but in the modern day CPUs are really really fast, and so are SSDs - it's just not necessary to write code that stays all in one data step or runs entirely in memory. The tradeoff is that you get the ability to do crazy large amounts of things that couldn't possibly fit in memory, work with massive datasets, etc. - only constrained by disk, which is usually in far greater supply. It's a tradeoff worth making in many cases. When it's possible to do something in a PROC, do so, even when that means it costs a tiny bit of time at the end to re-merge it - the PROCs are what you're paying SAS the big bucks for, they are easy to use, well tested, and fast at what they do.
OK, I think I found a way of doing that.
First, you produce the summing varables:
proc means data= <dataset> noprint nway;
by varA varB;
var varX;
output out=<TEMPdataset> sum = SUMvarX;
run;
then you merge the two datasets:
DATA <dataset>;
MERGE <TEMPdataset> <dataset>;
BY varA varB;
run;
This seems to work, although an extra dataset and several extra variables are formed in the process.
There are probably more efficient ways of doing it...
Ever heard of DoW Loop?
*-- Create synthetic data --*
data have;
varA=2; varB=4; varX=21; output;
varA=4; varB=6; varX=32; output;
varA=5; varB=8; varX=83; output;
varA=4; varB=3; varX=78; output;
varA=4; varB=8; varX=72; output;
varA=2; varB=4; varX=72; output;
run;
proc sort data=have; by varA varB; quit;
varA varB varX
2 4 21
2 4 72
4 3 78
4 6 32
4 8 72
5 8 83
data stage1;
set have;
by varA varB;
if first.varB then group_number+1;
run;
data want;
do _n_=1 by 1 until (last.group_number);
set stage1;
by group_number;
SUMvarX=sum(SUMvarX, varX);
end;
do until (last.group_number);
set stage1;
by group_number;
output;
end;
drop group_number;
run;
varA varB varX SUMvarX
2 4 21 93
2 4 72 93
4 3 78 78
4 6 32 32
4 8 72 72
5 8 83 83

Compute growth rate, improvements over PROC EXPAND

I have a SAS dataset, sorted, which has two columns: PERIOD and MYMETRIC
For each row, I want to compute the growth rate of the 4 periods preceding, by using a linear regression. So the formula is basically
GROWTH RATE = Cov([MYMETRIC_lag_4,MYMETRIC_lag_3, MYMETRIC_lag_2, MYMETRIC_lag_1],[1,2,3,4])/Var([1,2,3,4])
I can do this in SAS through a proc expand to compute the lags, then a data step to compute the growth rate. I was wondering if there was a shorter way to do this? Especially if suddenly, I choose to include 8 points and not 4, I want to minimize the rework.
You can use a data step entirely. This assumes you're asking for the four previous rows. I'm not sure what [1,2,3,4] means, though, so you'll have to fill in exactly what that means in the growth rate.
%let numlags=4;
data want;
set have;
array lags[&numlags] _temporary_; *temporary arrays are retained!;
growth_rate = cov(of lags[*])/var(of lags[*]); *if you want cov of the 4 lags divided by var of the 4 lags;
*move things about;
do _t = 1 to dim(lags)-1;
lags[_t]=lags[_t+1];
end;
lags[dim(lags)] = MYMETRIC;
run;

SAS sequential regression (in Quandt's log likelihood method)

I am coding in SAS Enterprise Guide 4.2.
I am trying to calculate the Quandt's log likelihood ratio. But it is not important to understand that to understand my question.
The ratio is based on sequential regressions.
Namely regressions from 1 to t0 where 1<=t0<=T and T is the samplesize.
Illustration:
First perform regression on the first observation
Then perform regression on the first two observations
Then perform regression on the first 3 observations
...and so on
It is also performing a "forward regression" from t0+1 to T.
Illustration:
First perform regression on the last T-1 observations
Then perform regression on the last T-2 observations
Then perform regression on the last T-3 observations
...and so on
The regression is an Ordinary Least Squares regression.
After the regression is performed, the square of the residuals are summed.
So this is what I need.
For each observation t0 I want to:
do an OLS regression from 1 to t0 and sum up the square of the residuals
do an OLS regression from t0+1 to T and sum up the square of the residuals
The data consists of one group variable, one dependent variable and one independent variable.
The calculations should be performed grouped by the group variable (but that should'nt be too difficult).
I have been able to do part of this task myself, but it is horribly ineffeicient and since the data consists of over 1,000,000,000 observations efficiency is very important
I have also noticed that the procedure "autoreg" calculates the CUSUM statistic that is also based on sequential regression and therefore I suspect that this functionality could be availible in SAS but I haven't been able to find it.
And the part I am struggling with most right now is the summation.
Simple example of the summation I want to do:
Input:
col1 col2
1 2
2 5
5 4
7 6
Output:
col3
2 =1*2
15 =1*5+2*5
32 =1*4+2*4+5*4
90 =1*6+2*6+5*6+7*6
Has anyone encounter a similar problem or have any idea on how to solve it in an efficient way?
All help is welcome and feel free to ask me to clarify something if it is unclear.
As far as the summation goes, the below should work (though your input dataset must be sorted by group first).
Since the summation you're asking for is basically col2 multiplied by the cumulative sum of col1 within each group, you can use a retain statement to keep track of the sum of col1, and by-group processing to reset the cumulative sum each time the data step encounters a new group.
data output;
retain cusum;
set input;
by group;
if first.group then cusum = col1;
else cusum = cusum + col1;
col3 = cusum * col2;
drop cusum;
run;

SAS macros to average between a range of dates with missing dates in the data

I'm completely new to SAS and its macros. I have this dataset, named mydata:
Obs SYMBOL DATE kx y
1 A 20120128 5 6
2 B 20120128 10 7
3 C 20120128 20 9
4 D 20120128 6 10
5 E 20120128 9 20
My problem is to find this function:
Newi = ∑ j€[-10,-2] (x+y)i,j /N,
where,
i = any random date(user defined)
-10 and -2(10 days or 2 days before i)
N= total number of days with data available for (x+y) between (-10,-2)
There can be missing dates in the available data.
Can anyone help me with the possible SAS macros for the following problem.
Thanks in Advance!!
I'm assuming your date data are stored as dates and can accept numeric calculations. I'm also assuming that you want to get average of X and Y for a particular date around d, where d is user defined. Last, I'm assuming that if you have two unique ids on the same day, you keep the first one at random. Obviously those assumptions might need to be tweaked a bit but, from what I believe you are asking (I confess I'm only mostly sure I understand your question), hopefully this is close enough to what you need that you can tweak the rest pretty easily.
Okay...
PROC SORT DATA in;
BY date uniqueid;
RUN;
%MACRO summarize( userdate );
DATA out;
SET in (where = (date >= &userdate -10 and date <= &userdate - 2);
BY date uniqueid;
xy = sum(x, y)
IF first.uniqueid;
RUN;
PROC SUMMARY DATA = out;
OUTPUT OUT = Averages&userdate MEAN(xy) = ;
RUN;
%MEND summarize;
%summarize('20120128'd);
What's going on here? Well, I sort the data first by date and uniqueid. I could use NODUPKEY, but I imagine you might want to control how duplicate uniqueids on a given date are handled. The dataset is throwing out the dups by keeping the first one that it comes across, but you could modify deduping logic (which is coming from the BY command in the DATA step and the IF first. command in the same).
You want a set of dates around a particular user-defined date, d. So get d and filter the dataset with WHERE. You could also do this in your PROC SORT step, and there might be reasons for doing so if your raw data will be updated frequently. If you don't need the run the sort every time a user defines a date range, keep it outside the macro and only run it when needed. Sorts can be slow.
In the data step, I'm getting sum(x,y) to account for the fact that either x or y might be missing, or both, or neither. x + y would return missing in those cases. I assume that's now what you want, but do keep in mind that we'll be averaging out sum(x,y) over N, where N is "either x or y is not missing." If you wanted to ignore those rows entirely, use x + y and add IF xy != . in your DATA step.
The last part, the sum, should be pretty self-explanatory.
Hope this helps.

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).