SAS: Calculate standard deviation based on rolling window - sas

I need to obtain the standard deviation for period surrounding over (-2,+2) days, instead of (-2,+2) observations. my table is in below
date value standard_deviation
01/01/2015 18 ...
01/01/2015 15 ...
01/01/2015 5 ...
02/01/2015 66 ...
02/01/2015 7 ...
03/01/2015 7 ...
04/01/2015 19 ...
04/01/2015 7 ...
04/01/2015 11 ...
04/01/2015 17 ...
05/01/2015 3 ...
06/01/2015 7 ...
... ... ...
The tricky part is that there are different number of observations in each days. Therefore, I CANNOT just using the following code
PROC EXPAND DATA=TESTTEST OUT=MOVINGAVERAGE;
CONVERT VAL=AVG / TRANSFORMOUT=(MOVSTD 5);
RUN;
Anyone can tell me how to include only (-2, +2) days of centered moving Standard Deviation in this case ?
Thanks in advance !
Best

Sorry for the confusion. Here's how I'd do it using proc sql. Simply use a subquery with the std aggregate function and a where clause selecting the observations you want:
proc sql;
select
h.date,
h.value,
( select std(s.value)
from have s
where h.date between s.date-2 and s.date+2) as stddev
from
have h
;
quit;
If the period should refer to only the days existing in the table (as opposed to all calendar days), then create a variable counting the dates and use that instead of the date variable to select your observations.
data have2;
set have;
by date;
if first.date then sort+1;
run;
proc sql;
select
h.date,
h.value,
( select std(s.value)
from have2 s
where h.sort between s.sort-2 and s.sort+2) as stddev
from
have2 h
;
quit;

Related

Proc freq for values comparison

I have the following table
ID POINTS1 POINTS2 time Table1 Table2
12 10 9 2011 A B
13 12 8 2010 A B
22 9 11 2011 A C
24 8 8 2012 A C
I would need to visually compare these results focusing on the change of points of IDs between the two tables.
I am thinking about a correlation matrix or transition matrix where I have on x points1 and on y points2, possibly differently coloured based on the following conditions
A -> B
A -> C
I know that transition matrices can be created by frequency tables in SAS.
Proc freq data=my_table;
Tables points1*points2;
Run;
But I am wondering how to include information on the transition between table1 and table2, using colours in a two-way table with proc freq. I know that it could be possible using proc tabulate.
I have found a way using proc format and tabulate, instead of proc freq.
proc format;
value cell_col low-50=“green”
50-high=“red”;
run;
proc tabulate data=my_table s=[foreground=cell_col.];
class points1 points2 table2;
table points1 points2 table2;
run;

How can I compare many datasets update several columns based on the max value of a single column in SAS?

I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;

Numeric Macro in SAS

I have a variable for counting days. I'm trying to use the day count to divide by total days.
How do I create a macro that stores the most recent day and allows me to quote it later?
This is what I have so far (I've cut out code that's not relevant)
DATA scotland;
input day deathsscotland casesscotland;
cards;
1 1 85
2 1 121
3 1 153
4 1 171
5 2 195
6 3 227
7 6 266
8 6 322
9 7 373
10 10 416
11 14 499
12 16 584
13 22 719
14 25 894
;
run;
proc sort data=scotland out=scotlandsort;
by day;
run;
Data _null_;
keep day;
set scotlandsort end=eof;
if eof then output;
run;
%let daycountscot = day
Data ratio;
set cdratio;
SCOTLANDAVERAGE = (SCOTLANDRATIO/&daycountscot)*1000;
run;
Using your own code, you can create the macro variable like this
Data _null_;
keep day;
set scotlandsort end=eof;
if eof then call symputx('daycountscot', day);
run;
%put &daycountscot.;
The data _null_ is not doing anything. You can eliminate the sort and data steps by selecting the max day value directly into a macro variable.
proc sql noprint;
select max(day) into :daycountscot trimmed
from scotland
;
quit;
No need to use macro code for this, it is better to keep values in variables anyway. To convert the value into text to store it as a macro variable SAS will have to round the number.
You could make a dataset with the maximum DAY value and then combine it with the dataset where you want to do the division.
data last_day;
set scotlandsort end=eof;
if eof then output;
keep day;
rename day=last_day;
run;
data ratio;
set cdratio;
if _n_=1 then set last_day;
SCOTLANDAVERAGE = (SCOTLANDRATIO/last_day)*1000;
run;
Probably easier in SQL code:
proc sql;
create table ratio as
select a.*, (SCOTLANDRATIO/last_day)*1000 as SCOTLANDAVERAGE
from cdratio a
, (select max(day) as last_day from scotland)
;
quit;

modify a dataset to extend time range

Sorry for the confusing title.
Background
data looks like this
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
B 1mar 11 7 22
B 2mar 0 7 22
Area has several distinct values. For each area, LB and UB are fixed across multiple dates, while Ind varies. Date always starts from month start to certain day of the month.
Target
My target is to run a control chart for each area to see if Ind exceeds the range (LB,UB).
But if I just plot the raw data for each area, the xaxis by default not ends at the last day of the month (In the previous example, the plot will be from 1-Mar to 2-Mar instead of 31-Mar. I do know the by specifying the xmax option in xaxis the plot will extends to 31-Mar. But this only extends the xaxis, LB and UB still display from 1-Mar to 2-Mar, leaving the right side of the graph empty.
Thus I use modify to add in some date records.
What I have done
data have;
modify have;
do i = 0 to intck('day',today(),intnx('month',today(),0,'E'));
Date = today()+i;
call missing(Ind);
output;
end;
stop;
run;
proc sgplot data=have missing;
series ... Ind ...;
series ... LB ...;
series ... UB ...;
run;
Question
But this only works for one area. I need to modify each area first then plot them one by one. How can I relatively efficient to get below data
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
A 3mar . 1 20
....
A 31mar. 1 20
B 1mar 11 7 22
B 2mar 0 7 22
B 3mar . 7 22
....
B 31mar. 7 22
Or there's other options in proc sgplot to solve this?
You can use proc timeseries with the by-group area to get it into the form that you need. The end= option will let you specify an ending date for your data. It looks like you're using the current month, so we'll take your intnx function and plop it into a set of macro functions that resolve to a date literal (most ETS procs require a date literal for some reason).
We'll use two var statements: one for ind where we fill in unobserved values with ., and another for LB & UB to set their unobserved values with the previous valid value.
Note that we are assuming you've already put date into a SAS date. Make sure you do this first before running the below code.
proc timeseries data=have
out=want;
by area;
id Date interval=day notsorted
accumulate=none
end="%sysfunc(intnx(month, %sysfunc(today() ), 0, E), date9.)"d;
var Ind / setmissing=missing;
var LB UB / setmissing=previous;
run;
Your final dataset will look exactly as you'd like.

Extract month and year for new variable

I have a dataset with a variable that has the date of orders (MMDDYY10.). I need to extract the month and year to look at orders by month--year because there are multiple years. I am vaguely aware of the month and year functions, but how can I use them together to create a new month bin?
Ideally I would create a bin for each month/year so my output can look something like:
Date
Item 2011OCT 2011NOV 2011DEC 2012JAN ...
a 50 40 30 20
b 15 20 25 30
c 1 2 3 4
total
Here is a sample code I created:
data dsstabl;
set dsstabl;
order_month = month(AC_DATE_OF_IMAGE_ORDER);
order_year = year(AC_DATE_OF_IMAGE_ORDER);
order = compress(order_month||order_year);
run;
proc freq data
table item * _order;
run;
Thanks in advance!
When you are doing your analysis, use an appropriate format. MONYY. sounds like the right one. That will sort properly and will group values accordingly.
Something like:
proc means data=yourdata;
class datevar;
format datevar MONYY7.;
var whatever;
run;
So your table:
proc tabulate data=dsstabl;
class item datevar;
format datevar MONYY7.;
tables item,datevar*n;
run;
Or you can do it with nice trick.
data my_data_with_months;
set my_data;
MONTH = INTNX('month', NORMAL_DATE, 0, 'B');
run;
Always use it.