Suppose i have a table:
id date N
1 04FEB2017 1
1 04FEB2017 .
1 04FEB2017 2
1 04FEB2017 .
1 05FEB2017 3
2 04FEB2017 4
2 04FEB2017 5
It is sorted by id's, then by dates.
For all same id's, for all same date's, if N is not null, i want to keep only first row. So the result is:
id date N
1 04FEB2017 1
1 04FEB2017 .
1 04FEB2017 .
1 05FEB2017 3
2 04FEB2017 4
I tried to rank, but that didnt push me to any ideas
Because the data set is sorted, you can use the BY statement and the FIRST. syntax to find the first record in each group.
data have;
informat id 4. date date9. n 4.;
format date date9.;
input id date n;
datalines;
1 04FEB2017 1
1 04FEB2017 .
1 04FEB2017 2
1 04FEB2017 .
1 05FEB2017 3
2 04FEB2017 4
2 04FEB2017 5
;
run;
data want;
set have;
by id date;
if first.date or missing(n) ;
run;
The second data step only keeps the first date or if n is missing. This is edited based on the comment.
Related
I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;
Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3
I'd like to assign to an empty field a value based on many values of other entries. Here my dataset:
input ID date $10. type typea $10. ;
datalines;
1 10/11/2006 1 a
2 10/12/2006 2 a
2 . 2 b
3 20/01/2007 5 p
4 . 1 r
5 11/09/2008 1 ca
5 . 1 cb
5 . 2 b
;
run;
My goal is the following: for all empty entries of the variable "date", assign to it the same date of the record which has the same ID, the same type, but a different typea. If there aren't other records with the criteria described, leave the date field empty. So the output should be:
data temp;
input ID date $10. type typea $10.;
datalines;
1 10/11/2006 1 a
2 10/12/2006 2 a
2 10/12/2006 2 b
3 20/01/2007 5 p
4 . 1 r
5 11/09/2008 1 ca
5 11/09/2008 1 cb
5 . 2 b
;
run;
I tried with something like that based on another answer on SO (SAS: get the first value where a condition is verified by group), but it doesn't work:
by ID type typea ;
run;
data temp;
set temp;
by ID type typea ;
if cat(first.ID, first.type, first.typea) then date_store=date;
if cat(ID eq ID and type ne type and typea eq typea) then do;
date_change_type1to2=date_store;
end;
run;
Do you have any hints? Thanks a lot!
You could use UPDATE statement to help you carry-forward the DATE values for a group.
data have;
input ID type typea :$10. date :yymmdd. ;
format date yymmdd10.;
datalines;
1 1 a 2006-11-10
2 2 a 2006-12-10
2 2 b .
3 5 p 2007-01-20
4 1 r .
5 1 ca 2008-09-11
5 1 cb .
5 2 b .
;
data want;
update have(obs=0) have;
by id type ;
output;
run;
If there are also missing values of TYPEA then those will also be carried forward. If you don't want that to happen you could re-read just those variables after the update.
data want;
update have(obs=0) have;
by id type ;
set have(keep=typea);
output;
run;
The goal is to add a new row whenever there is a gap between the date variable between two rows grouped by id.
If the gap occurs, then duplicate a row that is first. However only the date feature should not be as the first row rather it should be incremented by one day.
Also, everything needs to be grouped by id. I need to achieve it without expanding the function.
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data sample;
set sample;
format date yymmdd10.;
run;
The desired result:
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-03 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-02 4 V
2 2020-01-03 1 B
2 2020-01-04 1 B
2 2020-01-05 9 F
;
data sample;
set sample;
format date yymmdd10.;
run;
You can perform a 1:1 self merge with the second self starting at row 2 in order to provide a lead value. A 1:1 merge does not use a BY statement.
Example:
data have;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
format date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data want;
* 1:1 merge without by statement;
merge
have /* start at row 1 */
have ( firstobs=2 /* start at row 2 for lead values */
keep=id date /* more data set options that prepare the lead */
rename = ( id=nextid
date=nextdate
))
;
output;
flag = '*'; /* marker for filled in dates */
if id = nextid then
do date=date+1 to nextdate-1;
output;
end;
drop next:;
run;
Result flagging filled in dates
To "look ahead" you can re-read the same dataset starting from the second observation. SAS will stop when you read past the end of the input so add an extra empty observation.
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd.;
format date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data want;
set sample;
by id;
set sample(firstobs=2 keep=date rename=(date=next_date)) sample(obs=1 drop=_all_);
output;
if not last.id then do date=date+1 to next_date-1; output; end;
run;
Results:
numeric_ character_
Obs id date feature feature next_date
1 1 2020-01-01 5 A 2020-01-02
2 1 2020-01-02 3 Z 2020-01-04
3 1 2020-01-03 3 Z 2020-01-04
4 1 2020-01-04 2 D 2020-01-05
5 1 2020-01-05 7 B 2020-01-01
6 2 2020-01-01 4 V 2020-01-03
7 2 2020-01-02 4 V 2020-01-03
8 2 2020-01-03 1 B 2020-01-05
9 2 2020-01-04 1 B 2020-01-05
10 2 2020-01-05 9 F .
I've got group data and it has flags created anytime a name is changed within that group. I can pull the last two or first two observations within the group, but I am struggling figuring out how to pull the last observation with a name change AND the row right after.
The below code give me the first or last two observations per group, depending on how I sort the data.
DATA LastTwo;
SET WhatIveGot;
count + 1;
BY group_ID /*data pre sorted*/;
IF FIRST.group_ID THEN count=1;
IF count<=2 THEN OUTPUT;
RUN;
What I need is to be the LAST observation with a name change AND the following row.
group_ID NAME DATE NAME_CHange
1 TOM 1/1/19 0
1 Jill 1/30/19 1
1 Jill 1/20/19 0
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 TOM 2/1/19 0
2 Jill 2/30/19 1
2 Jill 2/20/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
2 Jim 4/15/19 0
3 Joe 2/20/19 0
3 Kim 3/10/19 1
3 Kim 3/30/19 0
3 Ken 4/15/19 1
4 Tim 3/10/19 0
4 Tim 3/30/19 0
The desired output:
group_ID NAME DATE NAME_CHange
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
3 Ken 4/15/19 1
The cases for Group_ID 2 and 3 are the roadblock. The data is already sorted by date.
Thank you for any help in advance
Use DOW processing to determine where the last name change was. Apply that information in a succeeding loop.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id name notsorted;
if first.name then _index_of_last_name_change = _n_;
end;
do _n_ = 1 to _n_;
set have;
if _index_of_last_name_change <= _n_ <= _index_of_last_name_change+1 then OUTPUT;
end;
drop _:;
run;
I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.