Creating a sequence variable for crossover studies in SAS datastep - sas

I have the following dataset from a crossover design study with participant_id, treatment_arm, and date_of_treatment as follows:
participant_id
treatment_arm
date_of_treatment
1
A
Jan 1 2022
1
B
Jan 2 2022
1
C
Jan 3 2022
2
C
Jan 4 2022
2
B
Jan 5 2022
2
A
Jan 6 2022
So for participant_id 1, based on the order of the date_of_treatment, the sequence would be ABC. For participant_id 2, it would be CBA.
Based on the above, I want to create column seq as follows:
participant_id
treatment_arm
date_of_treatment
seq
1
A
Jan 1 2022
ABC
1
B
Jan 2 2022
ABC
1
C
Jan 3 2022
ABC
2
C
Jan 4 2022
CBA
2
B
Jan 5 2022
CBA
2
A
Jan 6 2022
CBA
How do I go about creating the column using the 3 variables participant_id, treatment_arm, and date_of_treatment in datastep?

You could use a double DoW Loop
data want;
do until (last.participant_id);
set have;
length seq :$3.;
by participant_id;
seq = cats(seq, treatment_arm);
end;
do until (last.participant_id);
set have;
by participant_id;
output;
end;
run;
Remember to change the length of seq should there be more than 3 treatments for each participant.
participant_id treatment_arm date_of_treatment seq
1 A 01JAN2022 ABC
1 B 02JAN2022 ABC
1 C 03JAN2022 ABC
2 C 04JAN2022 CBA
2 B 05JAN2022 CBA
2 A 06JAN2022 CBA

Related

Retaining max values over multiple columns

I have a dataset like below, and want to collapse a subject so that I can see if they were diagnosed with a disease at all within the past 3 years using SAS. Disease1-3 are binary yes/no flags.
For example - for subject a in 2021, since they had all 3 diseases in the prior year of 2020, they should also have flags for all those diseases in 2021 and 2022.
subject
year
disease1
disease2
disease 3
a
2020
1
1
1
a
2021
0
0
0
a
2022
0
0
0
b
2020
0
1
0
b
2021
1
0
0
b
2022
0
0
1
I'm hoping it would look something like this.
subject
year
disease1
disease2
disease 3
a
2020
1
1
1
a
2021
1
1
1
a
2022
1
1
1
b
2020
0
1
0
b
2021
1
1
0
b
2022
1
1
1
What would be the best way about going to do this? I've tried using a do loop and the retain statement, but get stuck due to the fact that there are multiple columns to consider (disease1-disease3).
Store the max value of disease into a temporary variable. Retain this for each group. If the stored max value is ever 1, set all subsequent values to be 1 for each disease.
data want;
set have;
by subject year;
array disease[*] disease1-disease3;
array disease_max[3] _temporary_;
retain disease_max;
do i = 1 to dim(disease);
if(first.subject) then disease_max[i] = 0; /* Reset disease max counter for each subject */
if(disease[i] = 1) then disease_max[i] = 1; /* Store max disease value */
if(disease_max[i] = 1) then disease[i] = 1; /* Set disease to 1 if disease_max is 1 */
end;
drop i;
run;
data have;
input subject $ year disease1 disease2 disease3;
datalines;
a 2020 1 1 1
a 2021 0 0 0
a 2022 0 0 0
b 2020 0 1 0
b 2021 1 0 0
b 2022 0 0 1
;
data temp;
set have;
array d disease:;
do over d;
if d = 0 then d = .;
end;
run;
data want;
update temp(obs=0) temp;
by subject;
array d disease:;
do over d;
if d = . then d = 0;
end;
output;
run;

How to Count Distinct for SAS PROC SQL with Rolling Date Window of 5 years?

I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;
Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3

Stata alternatives for lookup

I have a large Stata dataset that contains the following variables: year, state, household_id, individual_id, partner_id, and race. Here is an example of my data:
year state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
Note that, in the above table, column 1 and 2 are married to each other.
I want to create a variable that is one if the person is in an interracial marriage.
As a first step, I used the following code
by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.
This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using
gen id_user=string(household_id)+"."+string(individual_id)+string(year)
gen id_partner=string(household_id)+"."+string(partner_id)+string(year)
What I want to do now is to create something like what vlookup does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.
I guess it should be something like this?
gen inter2==1 if (find race[idpartner]) == (race[iduser])
The expected output should be like this
year state household_id individual_id partner_id race inter2
1980 CA 23 2 1 3 1
1980 CA 23 1 2 1 1
1990 NY 43 4 2 1 0
1990 NY 43 2 4 1 0
I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race for the people in each couple.
In the code below _N == 2 is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.
clear
input year str2 state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
end
generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
" " + string(partner_id), string(partner_id) + ///
" " + string(individual_id))
bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2
list, sepby(household_id) abbreviate(15)
+-------------------------------------------------------------------------------------+
| year state household_id individual_id partner_id race couple_id mixed |
|-------------------------------------------------------------------------------------|
1. | 1980 CA 23 2 1 3 1 2 1 |
2. | 1980 CA 23 1 2 1 1 2 1 |
|-------------------------------------------------------------------------------------|
3. | 1990 NY 43 4 2 1 2 4 0 |
4. | 1990 NY 43 2 4 1 2 4 0 |
+-------------------------------------------------------------------------------------+
This idea is documented in this article. The link gives free access to a pdf file.

How to sum by group and add new variable dependent by the other two variables in SAS SQL

data work.want2;
input Y M $ ID $ volume;
datalines;
2009 JAN A1 100
2009 FEB A1 20
2009 FEB A1 80
2009 JAN A2 100
2009 JAN A2 100
2009 FEB A2 20
2009 FEB A2 80
2009 JAN A3 100
2009 FEB A3 150
2009 MAR A3 100
2011 DEC A1 100
2011 DEC A1 20
2011 DEC A2 20
2011 DEC A3 120
2011 DEC A3 80
2011 OCT A1 100
2011 OCT A2 20
2011 OCT A2 100
;
proc print data=want2;
run;
/*Code 2--> to sum by Y M ID*/
PROC SQL;
create table want3 as SELECT
Y,
M,
ID,
sum(volume) AS sumvolume
FROM want2
GROUP BY Y, M ,ID;
QUIT;
/*Code 3 -->get sum by Y M*/
PROC SQL;
SELECT
Y,
M,
sum(sumvolume) AS sumvolume_MO
FROM want3
GROUP BY Y, M;
QUIT;
I have use SAS SQL(code 2) to sum by ID, Y and M. I want to add a new variable,Monthly volume, dependent on Y and M.I have use "code 3" to get the results.
Is it possible to combine code 2 and code 3 together to get the results as following? I always get errors.
Thanks in advance.
Y M ID sumvolume sumvolume_MO
2009 FEB A1 100 350
2009 FEB A2 100 350
2009 FEB A3 150 350
2009 JAN A1 100 400
2009 JAN A2 200 400
2009 JAN A3 100 400
2009 MAR A3 100 100
2011 DEC A1 120 340
2011 DEC A2 20 340
2011 DEC A3 200 340
2011 OCT A1 100 220
2011 OCT A2 120 220
Updated to reflect results wanted sum(volume) instead of raw volume.
In general you would want to use sub queries. You could calculate the sum over the different groupings in separate subqueries and merge the results back together.
select a.y,a.m,a.id,a.sumvolume,b.sumvolume_mo
from
(select y,m,id,sum(volume) as sumvolume
from have
group by 1,2,3
) a
natural join
(select y,m,sum(volume) as sumvolume_mo
from have
group by 1,2
) b
;
But PROC SQL in SAS will also let you include non group and non aggregate variables in the SELECT and automatically remerge the data for you. So your could get SUMVOLUME_MO by adding up the values of SUMVOLUME.
select y,m,id,sumvolume,sum(sumvolume) as sumvolume_mo
from
(select y,m,id,sum(volume) as sumvolume
from have
group by 1,2,3
)
group by 1,2
;
Thanks to TOM's answers. I can get the results from the following codes.
PROC SQL;
create table newwant2 as
select y,m,id, sum(volume) as sumvolume_mo2,sumvolume_mo
from newwant
group by Y,M,id
;
Then I use the following code to delete the duplicate rows and keep the last row of each duplicate.
data newwant3;
set newwant2;
by Y M ID sumvolume_mo2 ;
if last.ID;
run;
proc print data=newwant3;
run;

SAS Rolling conditional statement (similar to excel)

I have a dataset and I would like to create a rolling conditional statement row by row (not sure what the exact term is called in SAS). I know how to do it in Excel but not sure on how it can be executed on SAS. The following is the dataset and what I would like to achieve.
Data set
----A---- | --Date-- | Amount |
11111 Jan 2015 1
11111 Feb 2015 1
11111 Mar 2015 2
11111 Apr 2015 2
11111 May 2015 2
11111 Jun 2015 1
11112 Jan 2015 2
11112 Feb 2015 1
11112 Mar 2015 1
11112 Apr 2015 4
11112 May 2015 3
11112 Jun 2015 1
I would like to 2 columns by the name of 'X' and 'Frequency' which would provide for each Column 'A' and 'Date' whether the Amount has gone up or down and by how much. See sample output below.
----A---- | --Date-- | Amount | --X-- | Frequency |
11111 Jan 2015 1 0 0
11111 Feb 2015 1 0 0
11111 Mar 2015 2 Add 1
11111 Apr 2015 2 0 0
11111 May 2015 2 0 0
11111 Jun 2015 1 Drop 1
11112 Jan 2015 2 0 0
11112 Feb 2015 1 Drop 1
11112 Mar 2015 1 0 0
11112 Apr 2015 4 Add 3
11112 May 2015 3 Drop 1
11112 Jun 2015 1 Drop 2
Example using Lag1():
Data A;
input date monyy7. Y $;
datalines;
Jan2015 1
Feb2015 1
Mar2015 2
Apr2015 2
May2015 2
Jun2015 1
Jan2015 2
Feb2015 1
Mar2015 1
Apr2015 4
May2015 3
Jun2015 1
;
data B;
set A;
lag_y=lag1(Y);
if lag_y = . then X ='missing';
if Y = lag_y then X='zero';
if Y > lag_y and lag_y ^= . then x='add';
if Y < lag_y then x= 'drop';
freq= abs(Y-lag_y);
run;
Output:
Obs date Y lag_y X freq
1 20089 1 missing
2 20120 1 1 zero 0
3 20148 2 1 add 1
4 20179 2 2 zero 0
5 20209 2 2 zero 0
6 20240 1 2 drop 1
7 20089 2 1 add 1
8 20120 1 2 drop 1
9 20148 1 1 zero 0
10 20179 4 1 add 3
11 20209 3 4 drop 1
12 20240 1 3 drop 2