I am new to SAS and wondered how to most efficiently list the months and years that fall between a starting date and ending date, in addition to the starting and ending date themselves. I've read about the INTCK and INTNX functions, the EXPAND function for time series data, and even CALENDAR FILL, but I'm not sure how to use them for this particular purpose. This task is easy to accomplish manually with a small dataset in Excel thanks to the drag-down autofill feature, but I need to find a way to do this in SAS due to the size of the dataset. Any suggestions would be greatly appreciated. Thank you!
The dataset is in a large text file organized like this now:
ID Start End
1000 08/01/2012 12/31/2012
1001 07/01/2010 05/31/2011
1002 04/01/1990 10/31/1991
But the output should look like this in the end:
ID MonthYear
1000 08/12
1000 09/12
1000 10/12
1000 11/12
1000 12/12
1001 07/10
1001 08/10
1001 09/10
1001 10/10
1001 11/10
1001 12/10
1001 01/11
1001 02/11
1001 03/11
1001 04/11
1001 05/11
1002 04/90
1002 05/90
1002 06/90
1002 07/90
1002 08/90
1002 09/90
1002 10/90
1002 11/90
1002 12/90
1002 01/91
1002 02/91
1002 03/91
1002 04/91
1002 05/91
1002 06/91
1002 07/91
1002 08/91
1002 09/91
1002 10/91
data want2;
set have;
do i = 0 to intck('month',start,end);
monthyear=intnx('month',start,i,'b');
output;
end;
format monthyear monyy.;
keep id monthyear;
run;
This will do the trick. PROC EXPAND may be more efficient, though I think it requires a number of desired observations rather than a start/end combination (though you could get that, I suppose).
data have;
informat start end MMDDYY10.;
input ID Start End;
datalines;
1000 08/01/2012 12/31/2012
1001 07/01/2010 05/31/2011
1002 04/01/1990 10/31/1991
;;;;
run;
data want;
set have;
format monthyear MMYYS5.; *formats the numeric monthyear variable with your desired format;
monthyear=start; *start with the initial observation;
output; *output it;
do _t = 1 by 1 until (month(monthyear)=month(end)); *iterate until end;
monthyear = intnx('month',monthyear,1,'b'); *go to the next start of month;
output; *output it;
end;
run;
Related
I am trying to select the last non-missing DAT value to ADT which if the SUBJID have two consecutive missing DAT, else the latest DAT will be set to the value of ADT.
Below code produce the data I have, and I want to have the ADT could be derived with the illustratioin of below rule (either finally merged to this set HAVE or just create into a totally new set):
for subjid 1001: 1997-05-01 for this subject, there is no consecutive missing (though only single non-consecutive missing)
for subjid 1002: 1998-02-01, as this subject has missing consecutively at AVISIT of 2-5
for subjid 1003: 1999-03-08, as the first consecutive missing happened at AVISIT of 4, and at AVISIT=3, there is non-missing DAT.
Hope you can help me. Thanks.
data have;
infile datalines truncover;
input subjid avisit dat : yymmdd10.;
format dat yymmdd10.;
datalines;
1001 0 1997-01-01
1001 1 1997-02-01
1001 2
1001 3 1997-05-01
1002 0 1998-01-01
1002 1 1998-02-01
1002 2
1002 3
1002 4
1002 5
1002 6 1998-12-01
1003 0 1999-01-01
1003 1 1999-02-01
1003 2
1003 3 1999-03-08
1003 4
1003 5
1003 6 1999-05-01
1003 7
1003 8
;
run;
The below will create a dataset that includes the last non-missing dat whenever the count of consecutively missing dat is greater than or equal to 2. Each time we encounter a missing dat, we increase the consecutive missing counter nmiss by 1.
We are always storing the last valid value of dat in the variables last_nonmissing_dat and last_nonmissing_avisit such that they always carry forward when a missing value is encountered. When two consecutive missing values occur, we output the results.
nmiss and last_nonmissing are reset whenever we move to a new subjid.
data want;
set have;
by subjid;
retain last_nonmissing_dat
last_nonmissing_avisit
;
if(first.subjid) then call missing(nmiss, of last_nonmissing:);
if(missing(dat)) then nmiss+1;
else do;
nmiss = 0;
last_nonmissing_dat = dat;
last_nonmissing_avisit = avisit;
end;
if(nmiss GE 2) then output;
format last_nonmissing_dat yymmdd10.;
run;
I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);
I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.
My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.
Example of observations after sorting data and flagging duplicates (fake data values):
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
What I want:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
AB 36 1980 45654 . 2135 1
ON 26 1990 . 35464 8868 0
ON 26 1990 . 35464 8868 1
So I can delete duplicates and eventually have this:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
ON 26 1990 . 35464 8868 0
I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.
Here is the code for the lead variables:
data uncleaned_data;
merge uncleaned_data
uncleaned_data(
firstobs=2
keep= TRANS_ID MORB_ID Varx
rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));
if lag(flag_duplicate=1) then do;
if TRANS_ID=. then do;
TRANS_ID= lead_TRANS_ID;
end;
if MORB_ID=. then do;
MORB_ID= lead_MORB_ID;
end;
if Varx=. then do;
Varx= lead_Varx;
end;
end;
run;
I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'
This method does not seem to work for many duplicate pairs in my data set.
Is there a better way to approach my problem overall? possibly through proc SQL?
Thanks for reading and any advice offered!
I'm assuming that you don't have different values of Trans_id, for example, for the same Province. If that is the case then you can flatten the original data in one go to achieve your goal, using an update statement with a by statement. In my code, the first reference to the dataset, with obs=0, just creates the variables, the second reference populates the values and the by statement ensures that only one row is updated per Providence.
Using this method means you don't need to identify the duplicate values beforehand.
data have;
input Province $ AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate;
datalines;
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
;
run;
data want;
update have(obs=0) have;
by province;
run;
Something like this should work...
proc sort data=uncleaned_data; by Province AGE BRTHYEAR; run;
data cleaned_data (DROP=TRANS_ID RENAME=(KEEP_TRANS_ID=TRANS_ID) ...);
set uncleaned_data;
by Province AGE BRTHYEAR;
if first.BRTHYEAR then do;
keep_TRANS_ID=TRANS_ID;
...
end;
else do;
if keep_TRANS_ID=. then keep_TRANS_ID=TRANS_ID;
...
end;
if last.BRTHYEAR then output;
run;
I've tried to Google and read around this problem, but I can't seem to find an adequate solution. I'm hoping someone here can help me. I'm sorry if it's too simple but I would appreciate any advice or help.
I'm working with a longitudinal dataset and I would like to assign an encounter number for each person (ID) who may have had one or more interactions with our laboratory (accesssion). The dataset looks something like this, and I would like to create a new variable (encounter) that numbers each unique encounter for each individual sequentially.
ID accession encounter
----------------------------------
1 1234 1
1 1234 1
1 1235 2
1 1236 3
1 1236 3
2 1000 1
2 1001 2
2 1001 2
3 1111 1
3 1112 2
4 1001 1
4 1001 1
I've tried using first.variable statements such as:
data new; set old;
by id accession;
if first.id & first.accession then encounter=1;
else encounter+1;
run;
I haven't been successful because it won't retain the same encounter number if both the id and accession number remain the same.
Thank you in advance for helping to point me in the right direction.
Your close. At the first of each ID you want to set it to 0, and at the first of each accession you want to increment.
data new; set old;
by id accession;
Retain encounter;
if first.id then encounter=0;
If first.accession then encounter+1;
run;
so i have the following dataset with three variables: account, balance, and time.
account balance time
1 110 01/2006
1 111 02/2006
1 88 03/2006
1 61 04/2006
1 1203 05/2006
2 112 01/2006
2 111 02/2006
2 665 03/2006
2 61 04/2006
2 1243 05/2006
3 110 01/2006
3 111 02/2006
3 88 03/2006
3 61 04/2006
3 1203 05/2006
each account has more records. so the starting time might be before what I wrote and the ending time might be after what i wrote.
so my question is:
I am trying to find the maximum balance for each account in prievious 12 monthes. For example, for account 3 on 05/2006, i am trying to find max(account 3 balance at 04/2006, account 3 balance at 03/2006, account 3 balance at 03/2006,............., account 3 balance at 04/2006).
what is in your mind? what i did is to use lag function with array. however, it is NOT so efficient. because i will be in trouble if i need previous 120 months.
Thank you.
Best
Xintong
Try this:
PROC SQL;
create table max_balance as
select *
from your_table
group by account
having balance=max(balance)
;
QUIT;
options mprint;
%macro lag;
data temp (drop=count i);
set balance;
by acct time;
array x(*) lag_bal1 - lag_bal3;
%do i=1 %to 3;
lag_bal&i=lag&i.(balance);
%end;
if first.acct then count=1;
do i=count to dim(x);
call missing(x(i));
end;
count+1;
max_ball=max(balance, of lag_bal1-lag_bal3);
%mend;
%lag;