SAS: in the DATA step, how to calculate the grand mean of a subset of observations, skipping over missing values - sas

I'm trying to calculate the grand mean of a subset of observations (e.g., observation 20 to observation 50) in the data step. In this calculation, I also want to skip over (ignore) any missing values.
I've tried to play around with the mean function using various if … then statements, but I can't seem to fit all of it together.
Any help would be much appreciated.
For reference, here's the basic outline of my data steps:
data sas1;
infile '[file path]';
input v1 $ 1-9 v2 $ 11 v3 13-17 [redacted] RPQ 50-53 [redacted] v23 101-106;
v1=translate(v1,"0"," ");
format [redacted];
label [redacted];
run;
data gmean;
set sas1;
id=_N_;
if id = 10-40 then do;
avg = mean(RPQ);
end;
/*Here, I am trying to calculate the grand mean of the RPQ variable*/
/*but only observations 10 to 40, and skipping over missing values*/
run;

Use the automatic variable /_N_/ to id the rows. Use a sum value that is retained row to row and then divide by the number of observations at the end. Use the missing() function to determine the number of observations present and whether or not to add to the running total.
data stocks;
set sashelp.stocks;
retain sum_total n_count 0;
if 10<=_n_<=40 and not missing(open) then do;
n_count=n_count+1;
sum_total=sum_total+open;
end;
if _n_ = 40 then average=sum_total/n_count;
run;
proc print data=stocks(obs=40 firstobs=40);
var average;
run;
*check with proc means that the value is correct;
proc means data=sashelp.stocks (firstobs=10 obs=40) mean;
var open;
run;

Related

Sum a number of specific rows before and after

I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.

Missing values in VARMAX

I have a dataset with visitors and weather variables. I'm trying to forecast visitors based on the weather variables. Since the dataset only consists of visitors in season there is missing values and gaps for every year. When running proc reg in sas it's all okay but the issue comes when i'm using proc VARMAX. I cannot run the regression due to missing values. How can i tackle this?
proc varmax data=tivoli4 printall plots=forecast(all);
id obs interval=day;
model lvisitors = rain sunshine averagetemp
dfebruary dmarch dmay djune djuly daugust doctober dnovember ddecember
dwednesday dthursday dfriday dsaturday dsunday
d_24Dec2016 d_05Dec2013 d_24Dec2017 d_24Dec2014 d_24Dec2015 d_24Dec2019
d_24Dec2018 d_24Sep2012 d_06Jul2015
d_08feb2019 d_16oct2014 d_15oct2019 d_20oct2016 d_15oct2015 d_22sep2017 d_08jul2015
d_20Sep2019 d_08jul2016 d_16oct2013 d_01aug2012 d_18oct2012 d_23dec2012 d_30nov2013 d_20sep2014 d_17oct2012 d_17jun2014
dFrock2012 dFrock2013 dFrock2014 dFrock2015 dFrock2016 dFrock2017 dFrock2018 dFrock2019
dYear2015 dYear2016 dYear2017
/p=7 q=2 Method=ml dftest;
garch p=1 q=1 form=ccc OUTHT=CONDITIONAL;
restrict
ar(3,1,1)=0, ar(4,1,1)=0, ar(5,1,1)=0,
XL(0,1,13)=0, XL(0,1,14)=0, XL(0,1,13)=0, XL(0,1,27)=0, XL(0,1,38)=0, XL(0,1,42)=0;
output lead=10 out=forecast;
run;
As with any forecast, you will first need to prepare your time-series. You should first run through your data through PROC TIMESERIES to fill-in or impute missing values. The impute choice that is most appropriate is dependent on your variables. The below code will:
Sum lvisitors by day and set missing values to 0
Set missing values of averagetemp to average
Set missing values of rain, sunshine, and your variables starting with d to 0 (assuming these are indicators)
Code:
proc timeseries data=have out=want;
id obs interval = day
setmissing = 0
notsorted
;
var lvisitors / accumulate=total;
crossvar averagetemp / accumulate=none setmissing=average;
crossvar rain sunshine d: / accumulate=none;
run;
Important Time Interval Consideration
Depending on your data, this could bias your error rate and estimates since you always know no one will be around in the off-season. If you have many missing values for off-season data, you will want to remove those rows.
Since PROC VARMAX does not support custom time intervals, you can instead create a simple time identifier. You can alternatively turn this into a format for proc format and converttime_id at the end.
data want;
set have;
time_id+1;
run;
proc varmax data=want;
id time_id interval=day;
...
output lead=10 out=myforecast;
run;
data myforecast;
merge myforecast
want(keep=time_id date)
;
by time_id;
run;
Or, if you made a format:
data myforecast;
set myforecast;
date = put(time_id, timeid.);
drop time_id;
run;

How can I create pivot table in SAS?

I have three columns in a dataset: spend, age_bucket, and multiplier. The data looks something like...
spend age_bucket multiplier
10 18-24 2x
120 18-24 2x
1 35-54 3x
I'd like a dataset with the columns as the age buckets, the rows as the multipliers, and the entries as the sum (or other aggregate function) of the spend column. Is there a proc to do this? Can I accomplish it easily using proc SQL?
There are a few ways to do this.
data have;
input spend age_bucket $ multiplier $;
datalines;
10 18-24 2x
120 18-24 2x
1 35-54 3x
10 35-54 2x
;
proc summary data=have;
var spend;
class age_bucket multiplier;
output out=temp sum=;
run;
First you can use PROC SUMMARY to calculate the aggregation, sum in this case, for the variable in question. The CLASS statement gives you things to sum by. This will calculate the N-Way sums and the output data set will contain them all. Run the code and look at data set temp.
Next you can use PROC TRANSPOSE to pivot the table. We need to use a BY statement so a PROC SORT is necessary. I also filter to the aggregations you care about.
proc sort data=temp(where=(_type_=3));
by multiplier;
run;
proc transpose data=temp out=want(drop=_name_);
by multiplier;
var spend;
id age_bucket;
idlabel age_bucket;
run;
In traditional mode 35-54 is not a valid SAS variable name. SAS will convert your columns to proper names. The label on the variable will retain the original value. Just be aware if you need to reference the variable later, the name has changed to be valid.

Moving average only for consecutive days in sas

I have a question regarding moving average. I use Proc Expand (cmovave 3), but those three days can be non consecutive I suppose. I want to avoid missing data between days and use moving average for just those adjacent days.
Is there any way that I can do this? If I want to put it in another way 'how can I select a part of my data set where I have values for consecutive period (days)?'. I hope you give me some examples for this problem.
Use Expand to make sure you have all the values in the timeseries interval. Then use a data step to calculate the ma3 with the lagN() functions.
If you data already has the correct timeseries interval, then skip the PROC EXPAND step.
data test;
start = "01JAN2013"d;
format date date9.
value best.;
do i=1 to 365;
r = ranuni(1);
value = rannor(1);
date = intnx('weekday',start,i);
dummy=1;
if r > .33 then output;
end;
drop i start r;
run;
proc expand data=test out=test2 to=weekday ;
id date;
var dummy;
run;
data test(drop=dummy);
merge test2 test;
by date;
ma3 = (value + lag(value) + lag2(value))/3;
run;
I use the DUMMY variable so that EXPAND will convert the series to WEEKDAY. Then drop it afterwards.

SAS-How to format arrays dynamically based on information in one column

I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.