Calculating rolling correlations in SAS

Calculating rolling correlations in SAS - sas

I have a data set here.
An excerpt of my data looks like this: (For an enlarged version: http://puu.sh/79NCK.jpg)
(Note: there are no missing values in my dataset)
I wish to calculate the correlation matrix using a rolling window of 1 year. My period starts from 01 Jan 2008. So for example, the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs. Similarly the correlation between AUT and BEL on 02 Jan 2008 is calculated using the series of values from 02 Jan 2007 to 02 Jan 2008.
Since there will be a different correlation matrix for each day, I want to output each day's correlation matrix into a sheet in excel and name that sheet COV1 (for 01 Jan 2008), COV2 (for 02 Jan 2008), COV3 (for 03 Jan 2008), and so on until COV1566 (for 31 Dec 2013). An excerpt of the output for each sheet is like this: (Note: with the titles included on the top row and first column)
http://puu.sh/79NAy.jpg
I have loaded my datafile into SAS named rolling. For the moment, my code is simply:
proc corr data = mm.rolling;
run;
Which simply calculates the correlation matrix using the entire series of values. I am very new to SAS, any help would be appreciated.

Think about how you might do if you had immense amount of patience.
proc corr data = mm.rolling out = correlation_as_of_01jan2008;
where date between '01jan2007'd and '01jan2008'd;
run;
Similarly,
proc corr data = mm.rolling out = correlation_as_of_02jan2008;
where date between '02jan2007'd and '02jan2008'd;
run;
Thankfully you can use SAS macro programming to achieve a similar effect as shown in this macro:
%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;
/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile
nobs= numDates;
format toDate fromDate date9.;
toDate=date;
fromDate = intnx('year', toDate, -1, 's');
call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );
/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;
run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
/*the output dataset from each PROC CORR run will be
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))
noprint;
run;
%end;
/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;
%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)
The final output from the above macro will have from & to identifier to identify which date range each correlation matrix refers. Run it and examine the results.
I dont think excel can accomodate over 1500 tabs anyway, so best to keep it in a single table. The final table had 81K rows and the whole process ran in 2.5 mins.
update: to sort them by from & to
proc sort data = ALL_CORRELATIONS;
by from to;
run;

Related

Apply if and do in SAS to merge a dataset

I'm trying to merge a dataset to another table (hist_dataset) by applying one condition.
The dataset that I'm trying to merge looks like this:
Label
week_start
date
Value1
Value2
Ac
09Jan2023
13Jan2023
45
43
The logic that I'm using is the next:
If the value("week_start" column) of the first record is equal to today's week + 14 then merge the dataset with the dataset that I want to append.
If the value(week_start column) of the first record is not equal to today's week + 14 then do nothing, don't merge the data.
The code that I'm using is the next:
libname out /"path"
data dataset;
set dataset;
by week_start;
if first.week_start = intnx('week.2', today() + 14, 0, 'b') then do;
data dataset;
merge out.hist_dataset dataset;
by label, week_start, date;
end;
run;
But I'm getting 2 Errors:
117 - 185: There was 1 unclosed DO block.
161 - 185: No matching DO/SELECT statement.
Do you know how can make the program run correctly or do you know another way to do it?
Thanks,
'''

I cannot make heads or tails of what you are asking. So let me take a guess at what you are trying to do and give answer to my guesses.
Let's first make up some dataset and variable names. So you have an existing dateset named OLD that has key variables LABEL WEEK_START and DATE.
Now you have received a NEW dataset that has those same variables.
You want to first subset the NEW dataset to just those observations where the value of DATE is within 14 days of the first value of START_WEEK in the NEW dataset.
data subset ;
set new;
if _n_=1 then first_week=start_week;
retain first_week;
if date <= first_week+14 ;
run;
You then want to merge that into the OLD dataset.
data want;
merge old subset;
by label week_start date ;
run;

Proc Tabulate: Reordering a Formatted Variable

I created a day variable using the following code:
DAY=datepart(checkin_date_time); /*example of checkin_date_time 1/1/2014 4:44:00*/
format DAY DOWNAME.;
Sample Data:
ID checkin_date_time Admit_Type BED_ORDERED_TO_DISPO
1 1/1/2014 4:40:00 ICU 456
2 1/1/2014 5:64:00 Psych 146
3 1/1/2014 14:48:00 Acute 57
4 1/1/2014 20:34:00 ICU 952
5 1/2/2014 10:00:00 Psych 234
6 1/2/2014 3:48:00 Psych 846
7 1/2/2014 10:14:00 ICU 90
8 1/2/2014 22:27:00 ICU 148
I want to analyze some data using Proc Tab where day is one of the class variables and have the day of week appear in chronological order in the output; however, the output table begins with Tuesday. I would like it to start with Sunday. I've read over the the following page http://support.sas.com/resources/papers/proceedings11/085-2011.pdf and tried the proc format invalue code but it's producing a table that where the "day of week" = "21". Not quite sure where to go from here.
Thanks!
proc format;
invalue day_name
'Sunday'=1
'Monday'=2
'Tuesday'=3
'Wednesday'=4
'Thursday'=5
'Friday'=6
'Saturday'=7;
value day_names
1='Sunday'
2='Monday'
3='Tuesday'
4='Wednesday'
5='Thursday'
6='Friday'
7='Saturday';
run;
data Combined_day;
set Combined;
day_of_week = input(day,day_name.);
run;
proc tabulate data = Combined_day;
class Day Admit_Type;
var BED_ORDERED_TO_DISPO ;
format day_of_week day_names.;
table Day*Admit_Type, BED_ORDERED_TO_DISPO * (N Median);
run;

Fundamentally, you are confusing actual values with displayed values (i.e., formats). Specifically, datepart extracts the date portion out of a date/time field. Then, applying a format only changes how it is displayed not actual underlying value. So below DAY never contains the character values of 'WEDNESDAY' or 'THURSDAY' but original integer value (19724 and 19725).
DAY = datepart(checkin_date_time); // DATE VALUE
format DAY DOWNAME.; // FORMATTED DATE VALUE (SAME UNDERLYING DATE VALUE)
Consider actually assigning a column as weekday value using WEEKDAY function. Then apply your user-defined format for proc tabulate.
data Combined_day;
set Combined;
checkin_date = datepart(checkin_date_time); // NEW DATE VALUE (NO TIME)
format checkin_date date9.;
checkin_weekday = weekday(checkin_date); // NEW INTEGER VALUE OF WEEKDAY
run;
proc tabulate data = Combined_day;
class checkin_weekday Admit_Type;
var BED_ORDERED_TO_DISPO ;
format checkin_weekday day_names.; // APPLY USER DEFINED FORMAT
table checkin_weekday*Admit_Type, BED_ORDERED_TO_DISPO * (N Median);
run;

SAS Line graphs from Data in a row rather than column

I've searched but none of the information shows how to plot a line graph from data that is given in a row, rather than column.
I have data in this form:
Firstname Lastname Sep Oct Nov Dec Jan Feb March April May June July
There are 100 rows of data with individual people. I have to plot each graph for each individual starting from Sep To July. My output will be 100 individual graphs. I know how to plot if the data is in column, but that is not what i am given. Changing the data is going to be too much work. I do not have any sas codes for rows:
**Proc sgplot data=data1;
series x=??? ( i need mths from Sep to July here)
Series y= ?? (will be the marks from the Sep to July)
Run;**
Here is how the output should look:

Your table needs to be in a flat format, e.g.:
FirstName LastName Date
John Smith 01JAN2018
Jane Doe 01JAN2018
This can be done with PROC TRANSPOSE. It is best to align your dates to a specific year/date. This will maintain the correct date order. Assume that your data is for 2018.
Create sample data
data have;
length name $10.;
array months[*] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec;
retain goal 75;
do name = 'Mark', 'Jane', 'Jake', 'John', 'Jack', 'Jill', 'Bill', 'Jerry', 'Joseph';
do i = 1 to dim(months);
months[i] = round(100*rand('uniform') );
end;
output;
end;
drop i;
run;
Solution
proc sort data=have;
by name goal;
run;
proc transpose data=have
out=have_transposed;
by name goal;
var Jan--Dec;
run;
data want;
set have_transposed;
Month = input(cats('01', _NAME_, 2018), DATE9.);
rename COL1 = Score;
format month monname3.;
drop _NAME_;
run;
proc sgplot data=want;
by name;
series x=month y=goal / name='goal' lineattrs=(color=salmon thickness=2);
series x=month y=score / name='series' lineattrs=(thickness=2);
scatter x=month y=score / markerattrs=(symbol=circlefilled) name='points';
keylegend 'series' 'goal';
run;

Filling in missing values of a rolling correlation matrix

This question partially relates to this question.
My datafile can be found here. I use a sample period from 01 Jan 2008 to 31 Dec 2013. The datafile has no missing values.
The following code generates the rolling correlation matrix on each day from 01 Jan 2008 to 31 Dec 2013 using a rolling window of the previous 1 year worth of values. E.g., the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs.
data work.rolling;
set mm.rolling;
run;
%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;
/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile
nobs= numDates;
format toDate fromDate date9.;
toDate=date;
fromDate = intnx('year', toDate, -1, 's');
call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );
/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;
run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
/*the output dataset from each PROC CORR run will be
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))
noprint;
run;
%end;
/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;
%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)
An excerpt of the output can be found here.
As can be seen row 2 to row 53 presents the correlation matrix for the day 1 Apr 2008. However, a problem arises for the correlation matrix for the day 1 Apr 2009: there are missing values for correlation coefficients for ALPHA and its pairs. This is because if one looks at the datafile, the values for ALPHA from 1 Apr 2008 to 1 Apr 2009 are all zero, hence causing a division by zero. This situation happens with a few other data values too, for example, HSBC also has all values as 0 from 1 Apr 08 to 1 Apr 09.
To resolve this issue, I was wondering how the above code can be modified so that in cases where this situation happens (i.e., all values are 0 between 2 certain dates), then the correlation between the two pairs of data values are simply calculated using the WHOLE sample period. E.g., the correlation between ALPHA and AUT is missing on 1 Apr 09, thus this correlation should be calculated using the values from 1 JAN 2008 to 31 DEC 2013, rather than using the values from 1 Apr 08 to 1 Apr 09

Once you run the above macro and have got your all_correlations dataset, you would need to run another PROC CORR this time using all of the data i.e.,
/*first filter the data to be between "01JAN2008"d and "31DEC2013"d*/
data work.all_data_01JAN2008_31DEC2013;
set mm.rolling;
where date between "01JAN2008"d and "31DEC2013"d;
drop date ;
run;
Then pass the above dataset to PROC CORR:
proc corr data = work.all_data_01JAN2008_31DEC2013
outp = correlations_01JAN2008_31DEC2013
(where=(_type_ = 'CORR'))
noprint;
run;
data correlations_01JAN2008_31DEC2013;
length id 8;
set correlations_01JAN2008_31DEC2013;
/*add a column identifier to make sure the order of the correlation matrix is preserved when joined with other tables*/
id = _n_;
run;
You would get a dataset which is unique by the _name_ column.
Then you would have to join correlations_01JAN2008_31DEC2013 to all_correlations in such a way that if a value is missing in all_correlations then a corresponding value from correlations_01JAN2008_31DEC2013 is inserted in its place. For this we can use PROC SQL & the COALESCE function.
PROC SQL;
CREATE TABLE MISSING_VALUES_IMPUTED AS
SELECT
A.FROM
,A.TO
,b.id
,a._name_
,coalesce(a.AUT,b.AUT) as AUT
,coalesce(a.BEL,b.BEL) as BEL
,coalesce(a.DEN,b.DEN) as DEN
,coalesce(a.FRA,b.FRA) as FRA
,coalesce(a.GER,b.GER) as GER
,coalesce(a.GRE,b.GRE) as GRE
,coalesce(a.IRE,b.IRE) as IRE
,coalesce(a.ITA,b.ITA) as ITA
,coalesce(a.NOR,b.NOR) as NOR
,coalesce(a.POR,b.POR) as POR
,coalesce(a.SPA,b.SPA) as SPA
,coalesce(a.SWE,b.SWE) as SWE
,coalesce(a.NL,b.NL) as NL
,coalesce(a.ERS,b.ERS) as ERS
,coalesce(a.RZB,b.RZB) as RZB
,coalesce(a.DEX,b.DEX) as DEX
,coalesce(a.KBD,b.KBD) as KBD
,coalesce(a.DAB,b.DAB) as DAB
,coalesce(a.BNP,b.BNP) as BNP
,coalesce(a.CRDA,b.CRDA) as CRDA
,coalesce(a.KN,b.KN) as KN
,coalesce(a.SGE,b.SGE) as SGE
,coalesce(a.CBK,b.CBK) as CBK
,coalesce(a.DBK,b.DBK) as DBK
,coalesce(a.IKB,b.IKB) as IKB
,coalesce(a.ALPHA,b.ALPHA) as ALPHA
,coalesce(a.ALBK,b.ALBK) as ALBK
,coalesce(a.IPM,b.IPM) as IPM
,coalesce(a.BKIR,b.BKIR) as BKIR
,coalesce(a.BMPS,b.BMPS) as BMPS
,coalesce(a.PMI,b.PMI) as PMI
,coalesce(a.PLO,b.PLO) as PLO
,coalesce(a.BINS,b.BINS) as BINS
,coalesce(a.MB,b.MB) as MB
,coalesce(a.UC,b.UC) as UC
,coalesce(a.BCP,b.BCP) as BCP
,coalesce(a.BES,b.BES) as BES
,coalesce(a.BBV,b.BBV) as BBV
,coalesce(a.SCHSPS,b.SCHSPS) as SCHSPS
,coalesce(a.NDA,b.NDA) as NDA
,coalesce(a.SEA,b.SEA) as SEA
,coalesce(a.SVK,b.SVK) as SVK
,coalesce(a.SPAR,b.SPAR) as SPAR
,coalesce(a.CSGN,b.CSGN) as CSGN
,coalesce(a.UBSN,b.UBSN) as UBSN
,coalesce(a.ING,b.ING) as ING
,coalesce(a.SNS,b.SNS) as SNS
,coalesce(a.BARC,b.BARC) as BARC
,coalesce(a.HBOS,b.HBOS) as HBOS
,coalesce(a.HSBC,b.HSBC) as HSBC
,coalesce(a.LLOY,b.LLOY) as LLOY
,coalesce(a.STANBS,b.STANBS) as STANBS
from all_correlations as a
inner join correlations_01JAN2008_31DEC2013 as b
on a._name_ = b._name_
order by
A.FROM
,A.TO
,b.id
;
quit;
/*verify that no missing values are left. NMISS column should be 0 from all variables*/
proc means data = MISSING_VALUES_IMPUTED n nmiss;
run;

Extract month and year for new variable

I have a dataset with a variable that has the date of orders (MMDDYY10.). I need to extract the month and year to look at orders by month--year because there are multiple years. I am vaguely aware of the month and year functions, but how can I use them together to create a new month bin?
Ideally I would create a bin for each month/year so my output can look something like:
Date
Item 2011OCT 2011NOV 2011DEC 2012JAN ...
a 50 40 30 20
b 15 20 25 30
c 1 2 3 4
total
Here is a sample code I created:
data dsstabl;
set dsstabl;
order_month = month(AC_DATE_OF_IMAGE_ORDER);
order_year = year(AC_DATE_OF_IMAGE_ORDER);
order = compress(order_month||order_year);
run;
proc freq data
table item * _order;
run;
Thanks in advance!

When you are doing your analysis, use an appropriate format. MONYY. sounds like the right one. That will sort properly and will group values accordingly.
Something like:
proc means data=yourdata;
class datevar;
format datevar MONYY7.;
var whatever;
run;
So your table:
proc tabulate data=dsstabl;
class item datevar;
format datevar MONYY7.;
tables item,datevar*n;
run;

Or you can do it with nice trick.
data my_data_with_months;
set my_data;
MONTH = INTNX('month', NORMAL_DATE, 0, 'B');
run;
Always use it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Calculating rolling correlations in SAS - sas

Related

Apply if and do in SAS to merge a dataset

Proc Tabulate: Reordering a Formatted Variable

SAS Line graphs from Data in a row rather than column

Filling in missing values of a rolling correlation matrix

Extract month and year for new variable

Categories

Resources