Calculate lags and update flags after comparing years - sas

suppose to have the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 2
0002 2017 2018 1 1
0002 2019 2020 2 2
I'm trying to generate the following:
ID yS yE flagS FlagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2
meaning: flagE at the second row becomes equal to the flag at the first row (i.e., 1) because yE at the first row = 2017 is equal to yS at the second row, i.e., 2017.
Conversely, for ID=0002 nothing should be done because there is a gap between end and start in terms of years.
I tried the lag function on yE to then compare the year yS and yE but it performs the lag of the entire variable on the other IDs. Also using first.ID it does not work.
Can anyone help me please?

The following gives you the expected output
Sort by descending id yS
Simulate the leading function with the merging trick
Apply desired flag if the given condition is true
Sort back the output by ascending id yS to get expected format
proc sort data=have out=stage1;
by id descending ys;
run;
data stage2;
merge stage1 stage1(firstobs=2 keep=ye flage id rename=(ye=_ye flage=_flage
id=_id));
if id ne _id then
call missing(_ye, _flage, _id);
if _ye=ys then
flage=_flage;
drop _:;
run;
proc sort data=stage2 out=want;
by id ys;
run;
want
id yS yE flagS flagE
0001 2015 2017 1 1
0001 2017 2020 2 1
0002 2017 2018 1 1
0002 2019 2020 2 2

Related

Summary table of three variables

suppose to have the following:
ID HS REP YEAR
0001 A a 2015
0001 B a 2015
0001 B c 2015
0001 B d 2015
0002 A f 2015
0002 A g 2015
0002 B a 2015
...... .... ..... .....
I would like to get the count of "rep" per "HS" (for each HS) and also the count of "Ids" per "REP" for each "HS" (no matter if the same ID appears in more than two HS and hence it will be recorded two times). Desired output:
Year HS REP TotIDs
2015 A 3 2
2015 B 4 2
It means: HS "A" has 3 REPs and 2 IDs found (overall without distinguish by REP) corresponding to HS "A". The same for "B".
I need also some summary statistics like the mean, median etc. I there a way to do this with proc means or univariate or freq (maybe?) in one shot?
Thank you in advance
SQL is a little more versatile when you want distinct value counting within a group.
SQL does not have an aggregate function for mode statistic.
If the grouping is what appears to be year, hs the following is an example:
data have; input
ID HS $ REP $ YEAR x;
id2=id;
datalines;
0001 A a 2015 12
0001 B a 2015 2
0001 B c 2015 3
0001 B d 2015 5
0002 A f 2015 13
0002 A g 2015 14
0002 B a 2015 6
;
proc sql;
create table want as
select
year
, hs
, count(*) as rep
, count(distinct id) as id_n_unq
, sum (x) as x_sum
, mean (x) as x_mean
, median (x) as x_median
/* , mode(x) as x_mode -- SQL does not have aggregate function for mode*/
from have
group year, hs
;

Sort variables based on another data set and append data

is there a way in SAS to order columns (variables) of a data set based on the order of another data set? The names are perfectly equal.
And is there also a way to append them (vertically) based on the same column names?
Thank you in advance
ID YEAR DAYS WORK DATASET
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
DATASET DAYS WORK ID YEAR
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
I just need to sort the second data set based on the first and append the second to the first.
Can anyone help me please?
This should work:
data have1;
input ID YEAR DAYS WORK DATASET;
format ID z4.;
datalines;
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
;
run;
data have2;
input DATASET DAYS WORK ID YEAR;
format ID z4.;
datalines;
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
;
run;
First we create a new table by copying our first table. Then we just insert into it variables from the second table. No need to change the column order of the original second table.
proc sql;
create table want as
select *
from have1
;
insert into want(ID, YEAR, DAYS, WORK, DATASET)
select ID, YEAR, DAYS, WORK, DATASET
from have2
;
quit;
I have no idea how you could sort based on something that is not there.
But appending is trivial. You can just set them together.
data want;
set one two;
run;
And if both dataset are already sorted by some key variables (year perhaps in your example?) then you could interleave the observations instead. Just add a BY statement.
data want;
set one two;
by year;
run;
And if you want to make a new version of the second dataset with the variable order modified to match the variable order in the first dataset (something that really has nothing to with sorting the data) you could use the OBS= dataset option. So code like this will order the variables based on the order they have in ONE but not actually use any of the data from that dataset.
data want;
set one(obs=0) two;
run;

Calculate the lag of years by groups of records and between consecutive dates

I would like to ask your help to do the following:
I have a data set that looks like this:
ID Date1 Date2
a 2015 2015
a 2016 2016
a 2017 2018
a 2018 2020
b 2015 2016
b 2018 2019
b 2020 2020
.... ..... .....
Desired output:
ID Lag
a Start
a 1
a 1
a 0
b Start
b 3
b 1
.... ..... .....
I need to count how many years pass from Date2 to Date1 next row for each ID. For example: from Date2 = 2015 (first row) to Date1 2016 (second row) there is 1 year of difference. Can anyone help me please? The first row in the desired output should be set to "Start" to indicate that the lag cannot be calculated because it is the starting point.
Thank you in advance
Simple lag() and a first.
data want ;
set have ;
by ID ;
l = lag(Date2) ;
if not first.ID then diff = sum(Date1,-l) ;
run ;

Need to filter a large SAS data set on a list of 2000+ diagnosis codes. Should I use the IF statement or try to merge? [duplicate]

I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;

Find matches by condition between 2 datasets in SAS

I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;