I have a column called month as
month
JAN
FEB
...
DEC
I'd like to know how to convert them into 1,2,3,...,12 in SAS. Thanks a lot.
Use informat to convert it to number and use month() to get the month.
data have;
input month :$3. ##;
datalines;
JAN FEB DEC
;
data want;
set have;
x=month(input(month||'21',??monyy.));
run;
Concatenate the month with a year, make use of the MONYY. informat, use the MONTH. format and finally output as a numeric value using another input().
data have;
input month :$3. ##;
datalines;
JAN FEB DEC
;
data want;
set have;
month_num=input(put(input(catt(month, year(today())), monyy.), month.), 2.);
put month month_num;
run;
Results:
JAN 1
FEB 2
DEC 12
I have a data in format 01 Jan 19.00.00 (datetime), and I want to extract the month name only from it.
Tried the below code but getting output in numbers i.e. 1 2 3 and so on. I want the output either in Jan Feb mar format or January February march format.
data want;
set detail;
month = month(datepart(BEGIN_DATE_TIME));
run;
You can use the MONNAME format.
data test;
dt = datetime();
monname = put(datepart(dt),MONNAME.);
put monname=;
run;
If you want "OCT" not "OCTOBER" you can add a 3 to the format (MONNAME3.).
If you are using the value in a report the better approach might be to use a date value formatted with MONNAME.
The values of a date formatted variable will be ordered properly when the variable is used in a CLASS or BY statement. If you had instead computed a new variable as the month name, the default ordering of values would be alphabetical.
data want;
set have;
begin_date = datepart(BEGIN_DATE_TIME);
format begin_date MONNAME3.;
run;
I have a data set here.
An excerpt of my data looks like this: (For an enlarged version: http://puu.sh/79NCK.jpg)
(Note: there are no missing values in my dataset)
I wish to calculate the correlation matrix using a rolling window of 1 year. My period starts from 01 Jan 2008. So for example, the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs. Similarly the correlation between AUT and BEL on 02 Jan 2008 is calculated using the series of values from 02 Jan 2007 to 02 Jan 2008.
Since there will be a different correlation matrix for each day, I want to output each day's correlation matrix into a sheet in excel and name that sheet COV1 (for 01 Jan 2008), COV2 (for 02 Jan 2008), COV3 (for 03 Jan 2008), and so on until COV1566 (for 31 Dec 2013). An excerpt of the output for each sheet is like this: (Note: with the titles included on the top row and first column)
http://puu.sh/79NAy.jpg
I have loaded my datafile into SAS named rolling. For the moment, my code is simply:
proc corr data = mm.rolling;
run;
Which simply calculates the correlation matrix using the entire series of values. I am very new to SAS, any help would be appreciated.
Think about how you might do if you had immense amount of patience.
proc corr data = mm.rolling out = correlation_as_of_01jan2008;
where date between '01jan2007'd and '01jan2008'd;
run;
Similarly,
proc corr data = mm.rolling out = correlation_as_of_02jan2008;
where date between '02jan2007'd and '02jan2008'd;
run;
Thankfully you can use SAS macro programming to achieve a similar effect as shown in this macro:
%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;
/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile
nobs= numDates;
format toDate fromDate date9.;
toDate=date;
fromDate = intnx('year', toDate, -1, 's');
call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );
/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;
run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
/*the output dataset from each PROC CORR run will be
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))
noprint;
run;
%end;
/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;
%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)
The final output from the above macro will have from & to identifier to identify which date range each correlation matrix refers. Run it and examine the results.
I dont think excel can accomodate over 1500 tabs anyway, so best to keep it in a single table. The final table had 81K rows and the whole process ran in 2.5 mins.
update: to sort them by from & to
proc sort data = ALL_CORRELATIONS;
by from to;
run;
This question partially relates to this question.
My datafile can be found here. I use a sample period from 01 Jan 2008 to 31 Dec 2013. The datafile has no missing values.
The following code generates the rolling correlation matrix on each day from 01 Jan 2008 to 31 Dec 2013 using a rolling window of the previous 1 year worth of values. E.g., the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs.
data work.rolling;
set mm.rolling;
run;
%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;
/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile
nobs= numDates;
format toDate fromDate date9.;
toDate=date;
fromDate = intnx('year', toDate, -1, 's');
call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );
/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;
run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
/*the output dataset from each PROC CORR run will be
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))
noprint;
run;
%end;
/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;
%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)
An excerpt of the output can be found here.
As can be seen row 2 to row 53 presents the correlation matrix for the day 1 Apr 2008. However, a problem arises for the correlation matrix for the day 1 Apr 2009: there are missing values for correlation coefficients for ALPHA and its pairs. This is because if one looks at the datafile, the values for ALPHA from 1 Apr 2008 to 1 Apr 2009 are all zero, hence causing a division by zero. This situation happens with a few other data values too, for example, HSBC also has all values as 0 from 1 Apr 08 to 1 Apr 09.
To resolve this issue, I was wondering how the above code can be modified so that in cases where this situation happens (i.e., all values are 0 between 2 certain dates), then the correlation between the two pairs of data values are simply calculated using the WHOLE sample period. E.g., the correlation between ALPHA and AUT is missing on 1 Apr 09, thus this correlation should be calculated using the values from 1 JAN 2008 to 31 DEC 2013, rather than using the values from 1 Apr 08 to 1 Apr 09
Once you run the above macro and have got your all_correlations dataset, you would need to run another PROC CORR this time using all of the data i.e.,
/*first filter the data to be between "01JAN2008"d and "31DEC2013"d*/
data work.all_data_01JAN2008_31DEC2013;
set mm.rolling;
where date between "01JAN2008"d and "31DEC2013"d;
drop date ;
run;
Then pass the above dataset to PROC CORR:
proc corr data = work.all_data_01JAN2008_31DEC2013
outp = correlations_01JAN2008_31DEC2013
(where=(_type_ = 'CORR'))
noprint;
run;
data correlations_01JAN2008_31DEC2013;
length id 8;
set correlations_01JAN2008_31DEC2013;
/*add a column identifier to make sure the order of the correlation matrix is preserved when joined with other tables*/
id = _n_;
run;
You would get a dataset which is unique by the _name_ column.
Then you would have to join correlations_01JAN2008_31DEC2013 to all_correlations in such a way that if a value is missing in all_correlations then a corresponding value from correlations_01JAN2008_31DEC2013 is inserted in its place. For this we can use PROC SQL & the COALESCE function.
PROC SQL;
CREATE TABLE MISSING_VALUES_IMPUTED AS
SELECT
A.FROM
,A.TO
,b.id
,a._name_
,coalesce(a.AUT,b.AUT) as AUT
,coalesce(a.BEL,b.BEL) as BEL
,coalesce(a.DEN,b.DEN) as DEN
,coalesce(a.FRA,b.FRA) as FRA
,coalesce(a.GER,b.GER) as GER
,coalesce(a.GRE,b.GRE) as GRE
,coalesce(a.IRE,b.IRE) as IRE
,coalesce(a.ITA,b.ITA) as ITA
,coalesce(a.NOR,b.NOR) as NOR
,coalesce(a.POR,b.POR) as POR
,coalesce(a.SPA,b.SPA) as SPA
,coalesce(a.SWE,b.SWE) as SWE
,coalesce(a.NL,b.NL) as NL
,coalesce(a.ERS,b.ERS) as ERS
,coalesce(a.RZB,b.RZB) as RZB
,coalesce(a.DEX,b.DEX) as DEX
,coalesce(a.KBD,b.KBD) as KBD
,coalesce(a.DAB,b.DAB) as DAB
,coalesce(a.BNP,b.BNP) as BNP
,coalesce(a.CRDA,b.CRDA) as CRDA
,coalesce(a.KN,b.KN) as KN
,coalesce(a.SGE,b.SGE) as SGE
,coalesce(a.CBK,b.CBK) as CBK
,coalesce(a.DBK,b.DBK) as DBK
,coalesce(a.IKB,b.IKB) as IKB
,coalesce(a.ALPHA,b.ALPHA) as ALPHA
,coalesce(a.ALBK,b.ALBK) as ALBK
,coalesce(a.IPM,b.IPM) as IPM
,coalesce(a.BKIR,b.BKIR) as BKIR
,coalesce(a.BMPS,b.BMPS) as BMPS
,coalesce(a.PMI,b.PMI) as PMI
,coalesce(a.PLO,b.PLO) as PLO
,coalesce(a.BINS,b.BINS) as BINS
,coalesce(a.MB,b.MB) as MB
,coalesce(a.UC,b.UC) as UC
,coalesce(a.BCP,b.BCP) as BCP
,coalesce(a.BES,b.BES) as BES
,coalesce(a.BBV,b.BBV) as BBV
,coalesce(a.SCHSPS,b.SCHSPS) as SCHSPS
,coalesce(a.NDA,b.NDA) as NDA
,coalesce(a.SEA,b.SEA) as SEA
,coalesce(a.SVK,b.SVK) as SVK
,coalesce(a.SPAR,b.SPAR) as SPAR
,coalesce(a.CSGN,b.CSGN) as CSGN
,coalesce(a.UBSN,b.UBSN) as UBSN
,coalesce(a.ING,b.ING) as ING
,coalesce(a.SNS,b.SNS) as SNS
,coalesce(a.BARC,b.BARC) as BARC
,coalesce(a.HBOS,b.HBOS) as HBOS
,coalesce(a.HSBC,b.HSBC) as HSBC
,coalesce(a.LLOY,b.LLOY) as LLOY
,coalesce(a.STANBS,b.STANBS) as STANBS
from all_correlations as a
inner join correlations_01JAN2008_31DEC2013 as b
on a._name_ = b._name_
order by
A.FROM
,A.TO
,b.id
;
quit;
/*verify that no missing values are left. NMISS column should be 0 from all variables*/
proc means data = MISSING_VALUES_IMPUTED n nmiss;
run;
I have a SAS dataset that looks like this:
id | Date | ...
1 17 Jun
1 19 Jun
2 17 Jun
2 19 Jun
2 21 Jun
3 12 May
each id represents a unique person.
I want to keep only 1 row for each unique person, however, still keep the date in dataset.
TO achieve this, I need to transform the table into format such as:
id | Date1 | Date2 | Date 3
1 17 Jun 19 Jun
2 17 Jun 19 Jun 21 Jun
3 12 May
If only 1 date has been assigned to that person, then keep the date2 and date3 as missing value.
The full dataset I'm using contains thousands of observations with over 180 different days. However, a unique person will at most be assigned to 5 different days.
Any help appreciated
PROC SUMMARY has functionality to do this, using the IDGROUP statement. The code below will transpose the data and create 5 date columns (specified by out[5]), in date order (specified by min(date)). If you want more information on how this works then check the IDGROUP statement in the PROC MEANS / SUMMARY documentation.
data have;
input id Date :date9.;
format date date9.;
datalines;
1 17Jun2012
1 19Jun2012
2 17Jun2012
2 19Jun2012
2 21Jun2012
3 12May2012
;
run;
proc summary data=have nway;
class id;
output out=want (drop=_:)
idgroup(min(date) out[5] (date)=);
run;
Using Proc Transpose, then using a Data Step (and borrowing Keith's data).
Both ways need the data sorted by ID.
data have;
input id Date :date9.;
format date date9.;
datalines;
1 17Jun2012
1 19Jun2012
2 17Jun2012
2 19Jun2012
2 21Jun2012
3 12May2012
4 01JAN2013
4 02JAN2013
4 03JAN2013
4 04JAN2013
4 05JAN2013
;
run;
proc sort data=have;
by id;
run;
Proc transpose data=have out=transpose(drop=_name_) prefix=DATE;
by id;
run;
data ds(drop=cnt date);
retain date1 date2 date3 date4 date5;
format date1 date2 date3 date4 date5 mmddyy10.;
set have;
by id;
if first.id then cnt=1;
select(cnt);
when(1) date1=date;
when(2) date2=date;
when(3) date3=date;
when(4) date4=date;
when(5) date5=date;
otherwise;
end;
cnt+1;
if last.id then do;
output;
call missing(of date1-date5);
end;
run;