I have the following matrix of data, which I am reading into SAS:
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
It represents 5 different species of animal (the rows) in 5 different areas of an environment (the columns). I want to get a Shannon diversity index for the whole environment, so I sum the rows to get:
48 54 68 79 84
Then calculate the Shannon index from this, to get:
1.5873488
What I need to do, however, is calculate a confidence interval for this Shannon index. So I want to perform a nonparametric bootstrap on the initial matrix.
Can anyone advise how this is possible in SAS?
There are several ways to do this in SAS. I would use proc surveyselect to generate the bootstrap samples, and then calculate the Shannon Index for each replicate. (I didn't know what the Shannon Index was, so my code is just based on what I read on Wikipedia.)
data animals;
input v1-v5;
cards;
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
run;
/* Generate 5000 bootstrap samples, with replacement */
proc surveyselect data=animals method=urs n=5 reps=5000 seed=10024 out=boots;
run;
/* For each replicate, calculate the sum of each variable */
proc means data=boots noprint nway;
class replicate;
var v:;
output out=sums sum=;
run;
/* Calculate the proportions, and p*log(p), which will be used next */
data sums;
set sums;
ttl=sum(of v1-v5);
array ps{*} p1-p5;
array vs{*} v1-v5;
array hs{*} h1-h5;
do i=1 to dim(vs);
ps{i}=vs{i}/ttl;
hs{i}=ps{i}*log(ps{i});
end;
keep replicate h:;
run;
/* Calculate the Shannon Index, again for each replicate */
data shannon;
set sums;
shannon = -sum(of h:);
keep replicate shannon;
run;
We now have a data set, shannon, which contains the Shannon Index calculated for each of 5000 bootstrap samples. You could use this to calculate p-values, but if you just want critical values, you can run proc means (or univariate if you want a 5% value, as I don't think it's possible to get 97.5 quantiles with proc means).
proc means data=shannon mean p1 p5 p95 p99;
var shannon;
run;
Related
I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);
As the title suggests, I experience some unexpected performance behaviour while working with a datastep.
A. The following code executes in 0.01 sec. So far so good.
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and record_typ='P'
)
)
;
run;
B. Additionally I have to filter a date, which is stored in a date-id, starting at 1 for 01/01/1850. Since I created formats to convert the date-id to a year (integer), I added the line input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017.
Works as expected. No problem here. I get my 15 expected records, and execution time increases marginally to 0.02 sec:
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017
and record_typ='P'
)
)
;
run;
C. Now here is the problem: In an effort to speed up my code for larger datasets, I replaced
input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017
with
kdu_dt_id gt 60997 - the equivalent of 01/01/2017.
To my understanding, this should be way faster, since there is no put/input calculation required. However, while this returns the same result as B., execution time increases to roughly 30.00 seconds.
What is did I miss?
Appendix: Log for further reference
1 The SAS System 13:56 Wednesday, February 7, 2018
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Programm';
4 %LET _CLIENTPROJECTPATH='R:\Projekte\20180125 Erneuerungsprovisionen\Erneuerungsprovisionen.egp';
5 %LET _CLIENTPROJECTNAME='Erneuerungsprovisionen.egp';
6 %LET _SASPROGRAMFILE=;
7
8 ODS _ALL_ CLOSE;
9 OPTIONS DEV=ACTIVEX;
10 GOPTIONS XPIXELS=0 YPIXELS=0;
11 FILENAME EGHTML TEMP;
12 ODS HTML(ID=EGHTML) FILE=EGHTML
13 ENCODING='utf-8'
14 STYLE=HtmlBlue
15 STYLESHEET=(URL="file:///C:/Program%20Files%20(x86)/SASHOME/x86/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
16 ATTRIBUTES=("CODEBASE"="http://www2.sas.com/codebase/graph/v94/sasgraph.exe#version=9,4")
17 NOGTITLE
18 NOGFOOTNOTE
19 GPATH=&sasworklocation
20 ;
NOTE: Writing HTML(EGHTML) Body file: EGHTML
21
22 GOPTIONS ACCESSIBLE;
23 data policen_roh;
24 set dwhprod.tbwh_kdu_detail_hi(
25 keep=
26 kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
27 where=(
28 police_nr=406045267
29 and kdu_dt_id gt 60997
30 and record_typ='P'
31 )
32 )
33 ;
34 run;
NOTE: There were 14 observations read from the data set DWHPROD.TBWH_KDU_DETAIL_HI.
WHERE (police_nr=406045267) and (kdu_dt_id>60997) and (record_typ='P');
NOTE: The data set WORK.POLICEN_ROH has 14 observations and 10 variables.
NOTE: Compressing data set WORK.POLICEN_ROH increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
real time 1:10.44
cpu time 0.03 seconds
35
36 GOPTIONS NOACCESSIBLE;
37 %LET _CLIENTTASKLABEL=;
38 %LET _CLIENTPROJECTPATH=;
39 %LET _CLIENTPROJECTNAME=;
40 %LET _SASPROGRAMFILE=;
41
42 ;*';*";*/;quit;run;
43 ODS _ALL_ CLOSE;
44
45
46 QUIT; RUN;
2 The SAS System 13:56 Wednesday, February 7, 2018
47
My guess is that your underlying table is on a database, and the removal of the input / put functions changed the query execution such that it is no longer making use of available indexes.
A bit counterintuitive, but try removing kdu_dt_id gt 60997 from your where clause and put it in a gating if statement instead.
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and record_typ='P'
)
)
;
if kdu_dt_id gt 60997;
run;
Alternatively speak to your dba about tuning your database if this is a query you will run often.
For more information, you could re-write your query using proc sql, with the _tree option (to view the execution plan). You could then use the _method option to play around / tune that plan.
Also, check out options sastrace=',,,d' sastraceloc=saslog; to show your dba more info on what is being sent to the database in terms of the underlying SQL query.
I am trying to construct centered moving average in SAS.
my table is in below
date number average
01/01/2015 18 ...
01/01/2015 15 ...
01/01/2015 5 ...
02/01/2015 66 ...
02/01/2015 7 ...
03/01/2015 7 ...
04/01/2015 19 ...
04/01/2015 7 ...
04/01/2015 11 ...
04/01/2015 17 ...
05/01/2015 3 ...
06/01/2015 7 ...
... ... ...
I need to obtain the average number for a surrounding period over (-2,+2) days, instead of (-2,+2) observations
I know that for Centered Moving Average, I can use.
convert number=av_number/transformout=(cmovave 3)
but here we have different number of observations in each day.
Anyone can tell me how to include only (-2, +2) days of centered moving average in this case ?
Thanks in advance !
Best
The suggestion from #Joe to aggregate to a daily level is the right approach, however you have to be careful that you don't lose the number of entries per day, otherwise you won't calculate the correct moving average. In other words, you need to weight the daily value by the number of entries for that day.
I've taken 3 steps to calculate the moving average, it may be possible to do it in 2 but I can't see how.
Step 1 is to calculate the sum and count of number per day.
Step 2 is to calculate the moving 5 day sum for both variables.
Step 3 then divides the sum by the count to get the weighted 5 day average.
I've added the trim function to exclude the first and last 2 records, obviously you can include those if you wish. You'll probably want to drop some of the extra variables as well.
data have;
input date :ddmmyy10. number;
format date date9.;
datalines;
01/01/2015 18
01/01/2015 15
01/01/2015 5
02/01/2015 66
02/01/2015 7
03/01/2015 7
04/01/2015 19
04/01/2015 7
04/01/2015 11
04/01/2015 17
05/01/2015 3
06/01/2015 7
;
run;
proc summary data=have nway;
class date;
var number;
output out=daily_agg sum=;
run;
proc expand data=daily_agg out=daily_agg_mov_sum;
convert number=tot_number / transformout = (cmovsum 5 trim 2);
convert _freq_=tot_count / transformout = (cmovsum 5 trim 2);
run;
data want;
set daily_agg_mov_sum;
if not missing(tot_number) then av_number = tot_number / tot_count;
run;
This is my input dataset:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03 Col_04 Col_bb
NYC 10 0 44 55 66 34 44
CHG 90 55 4 33 22 34 23
TAR 10 8 0 25 65 88 22
I need to calculate the % of Col_A0 for a specific reference.
For example % col_A0 would be calculated as
10/(10+0+44+55+66+34+44)=.0395 i.e. 3.95%
So my output should be
Ref %Col_A0 %Rest
NYC 3.95% 96.05%
CHG 34.48% 65.52%
TAR 4.58% 95.42%
I can do this part but the issue is column variables.
Col_A0 and Ref are fixed columns so they will be there in the input every time. But the other columns won't be there. And there can be some additional columns too like Col_10, col_11 till col_30 and col_cc till col_zz.
For example the input data set in some scenarios can be just:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03
NYC 10 0 44 55 66
CHG 90 55 4 33 22
TAR 10 8 0 25 65
So is there a way I can write a SAS code which checks to see if the column exists or not. Or if there is any other better way to do it.
This is my current SAS code written in Enterprise Guide.
PROC SQL;
CREATE TABLE output123 AS
select
ref,
(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb)) FORMAT=PERCENT8.2 AS PERCNT_ColA0,
(1-(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb))) FORMAT=PERCENT8.2 AS PERCNT_Rest
From Input123;
quit;
Scenarios where all the columns are not there I get an error. And if there are additional columns then I miss those. Please advice.
Thanks
I would not use SQL, but would use regular datastep.
data want;
set have;
a0_prop = col_a0/sum(of _numeric_);
run;
If you wanted to do this in SQL, the easiest way is to keep (or transform) the dataset in vertical format, ie, each variable a separate row per ID. Then you don't need to know how many variables there are to figure it out.
If you always want to sum all the numeric columns then just do :
col_A0 / sum(of _numeric_)
I am wondering the best way to transpose data in SAS when I have multiple occurances of my id variable. I know I can use the let option in the proc transpose statement to do this, but I do not want to get rid of any data, as I intend to compute averages.
Here is an example of my data and my code:
data grades;
input student testnum grade;
cards;
1 1 30
1 1 25
1 2 45
1 3 67
2 1 22
2 2 63
2 2 12
2 2 77
3 1 22
3 1 17
3 2 14
3 4 17
;
run;
proc sort data=grades;
by student testnum;
run;
proc transpose data=grades out=trgrades;
by student;
id testnum;
var grade;
run;
Here is how I would like my resulting dataset to look:
student testnum1 testnum2 testnum3 testnum4 avg12 avg34
1 30 45 67 . 33.33 67
1 25 . . . 33.33 67
2 22 63 . . 43.5 .
2 . 12 . . 43.5 .
2 . 77 . . 43.5 .
3 22 14 . 17 53 17
3 17 . . . 53 17
I want to use this new dataset (not sure how yet) to create the new columns that are the average score of all testnum1's and testnum2's for a student (avg12) and the average of all testenum3's and testnum4's (avg34) for a student.
There may be a much more efficient way to do this but I am stumped.
Any advice is appreciated.
If all you really need is the average of all test 1's and 2's, and 3's and 4's for each student, then you don't need to transpose at all. All you need is a simple data step:
data grouped;
set grades;
if testnum In (1,2) then group=1;
else if testnum in (3,4) then group=2;
run;
Then a basic proc means:
proc means data=grouped;
by student group;
var grade;
output out=averages mean=groupaverage;
run;
If you need the averages in a single observation, you can easily transpose the averages dataset.
proc transpose data=grades out=trgrades;
by student;
id group;
var grade;
run;
Update:
As mentioned by #Keith, using a format to group the tests is an excellent choice as well. Skip the data step and create the format like so:
proc format;
value TestGroup
1,2 = 'Tests 1 and 2'
3,4 = 'Tests 3 and 4'
;
run;
Then the proc means becomes:
proc means data=grouped;
by student testnum;
var grade;
format testnum TestGroup.;
output out=averages mean=groupaverage;
run;
End Update
If, for some reason, you really need to have all the test scores in one observation then I would recommend using a data step to make them uniquely identifiable. Use by, testnum.first, retain, and a simple counter to assign each score a retake number. Now your transpose uses retake and testnum as id variables. You should be able to figure it out from there.
Really hoping right now that I didn't just do your SAS homework assignment for you.