I have a dataset and I would like to create a rolling conditional statement row by row (not sure what the exact term is called in SAS). I know how to do it in Excel but not sure on how it can be executed on SAS. The following is the dataset and what I would like to achieve.
Data set
----A---- | --Date-- | Amount |
11111 Jan 2015 1
11111 Feb 2015 1
11111 Mar 2015 2
11111 Apr 2015 2
11111 May 2015 2
11111 Jun 2015 1
11112 Jan 2015 2
11112 Feb 2015 1
11112 Mar 2015 1
11112 Apr 2015 4
11112 May 2015 3
11112 Jun 2015 1
I would like to 2 columns by the name of 'X' and 'Frequency' which would provide for each Column 'A' and 'Date' whether the Amount has gone up or down and by how much. See sample output below.
----A---- | --Date-- | Amount | --X-- | Frequency |
11111 Jan 2015 1 0 0
11111 Feb 2015 1 0 0
11111 Mar 2015 2 Add 1
11111 Apr 2015 2 0 0
11111 May 2015 2 0 0
11111 Jun 2015 1 Drop 1
11112 Jan 2015 2 0 0
11112 Feb 2015 1 Drop 1
11112 Mar 2015 1 0 0
11112 Apr 2015 4 Add 3
11112 May 2015 3 Drop 1
11112 Jun 2015 1 Drop 2
Example using Lag1():
Data A;
input date monyy7. Y $;
datalines;
Jan2015 1
Feb2015 1
Mar2015 2
Apr2015 2
May2015 2
Jun2015 1
Jan2015 2
Feb2015 1
Mar2015 1
Apr2015 4
May2015 3
Jun2015 1
;
data B;
set A;
lag_y=lag1(Y);
if lag_y = . then X ='missing';
if Y = lag_y then X='zero';
if Y > lag_y and lag_y ^= . then x='add';
if Y < lag_y then x= 'drop';
freq= abs(Y-lag_y);
run;
Output:
Obs date Y lag_y X freq
1 20089 1 missing
2 20120 1 1 zero 0
3 20148 2 1 add 1
4 20179 2 2 zero 0
5 20209 2 2 zero 0
6 20240 1 2 drop 1
7 20089 2 1 add 1
8 20120 1 2 drop 1
9 20148 1 1 zero 0
10 20179 4 1 add 3
11 20209 3 4 drop 1
12 20240 1 3 drop 2
Related
I have the following dataset from a crossover design study with participant_id, treatment_arm, and date_of_treatment as follows:
participant_id
treatment_arm
date_of_treatment
1
A
Jan 1 2022
1
B
Jan 2 2022
1
C
Jan 3 2022
2
C
Jan 4 2022
2
B
Jan 5 2022
2
A
Jan 6 2022
So for participant_id 1, based on the order of the date_of_treatment, the sequence would be ABC. For participant_id 2, it would be CBA.
Based on the above, I want to create column seq as follows:
participant_id
treatment_arm
date_of_treatment
seq
1
A
Jan 1 2022
ABC
1
B
Jan 2 2022
ABC
1
C
Jan 3 2022
ABC
2
C
Jan 4 2022
CBA
2
B
Jan 5 2022
CBA
2
A
Jan 6 2022
CBA
How do I go about creating the column using the 3 variables participant_id, treatment_arm, and date_of_treatment in datastep?
You could use a double DoW Loop
data want;
do until (last.participant_id);
set have;
length seq :$3.;
by participant_id;
seq = cats(seq, treatment_arm);
end;
do until (last.participant_id);
set have;
by participant_id;
output;
end;
run;
Remember to change the length of seq should there be more than 3 treatments for each participant.
participant_id treatment_arm date_of_treatment seq
1 A 01JAN2022 ABC
1 B 02JAN2022 ABC
1 C 03JAN2022 ABC
2 C 04JAN2022 CBA
2 B 05JAN2022 CBA
2 A 06JAN2022 CBA
I have panel data of individuals, their marital status (0 = not married, 1 = married) and one random shock (0 = No shock, 1 = Shock). Now for the people who experience the shock (Everyone except id1), I would like to know which person was already married when they experienced the shock (n=2, id3, id5), who was not married when they experienced the shock but subsequently got married (n=1, id2) and who was not married when they experienced the shock and did not get married subsequently (n=1, id4).
* Example generated by -dataex-. For more info, type help dataex
clear
input int year str3 id float(shock maritalstatus)
2010 "id1" 0 1
2011 "id1" 0 1
2012 "id1" 0 1
2013 "id1" 0 0
2014 "id1" 0 0
2015 "id1" 0 0
2010 "id2" 1 0
2011 "id2" 0 1
2012 "id2" 0 1
2013 "id2" 0 1
2014 "id2" 0 1
2015 "id2" 0 1
2010 "id3" 0 1
2011 "id3" 0 1
2012 "id3" 0 1
2013 "id3" 1 1
2014 "id3" 0 1
2015 "id3" 0 1
2010 "id4" 1 0
2011 "id4" 0 0
2012 "id4" 0 0
2013 "id4" 0 0
2014 "id4" 0 0
2015 "id4" 0 0
2010 "id5" 0 1
2011 "id5" 0 1
2012 "id5" 1 1
2013 "id5" 0 1
2014 "id5" 0 1
2015 "id5" 0 1
end
Thanks for the data example.
Being married when the shock arrived is identifiable by looking at each observation, but the trick lies in spreading that to all observations for the same identifier.
egen married_at_shock = total(marital == 1 & shock == 1), by(id)
The next variable is a variation on the same theme.
egen not_married_at_shock = total(marital == 0 & shock == 1), by(id)
The last variable seems harder to me. I think you have to work out explicitly when the shock occurred
egen when_shock = mean(cond(shock == 1, year, .)), by(id)
and then check what happened afterwards
egen never_married_after_shock = total(marital & year > when_shock), by(id)
replace never_married_after_shock = never_married == 0 if when_shock < .
tabdisp id, c(*married*)
----------------------------------------------------------------------------
id | married_at_shock not_married_at_shock never_married_afte~k
----------+-----------------------------------------------------------------
id1 | 0 0 0
id2 | 0 1 0
id3 | 1 0 0
id4 | 0 1 1
id5 | 1 0 0
----------------------------------------------------------------------------
There are no doubt other ways to approach this.
Any reading list starts with underlining that true and false conditions yield 1 and 0 respectively
as discussed in this FAQ
which has many applications
such as applications to "any" and "all" questions, which include "ever" and "never"
The use of egen as a workhorse here is natural given your need to work both on observations for each identifier and over each history. Some tricks are covered in
this paper.
I have the following data with person ID and whether they have insurance in each year:
ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
What I want is to add another column, which equals 1 if a person is ever insured. For example, Person 2 only had insurance in 2002 but it means he has had insurance at some point, so Ever_Ins should equal 1 in all years:
ID Year Insured Ever_Ins
1 2001 1 1
2 2001 0 1
3 2001 0 0
1 2002 1 1
2 2002 1 1
3 2002 0 0
1 2003 1 1
2 2003 0 1
3 2003 0 0
I cannot use egen Ever_Ins = max(Insured), by (ID) because Insured is not a dummy in the true data. It has values such as 9 for unknown.
Technique for "any" and "all" problems is documented in this FAQ. See also this paper for a more detailed discussion. Here is one way to do it.
clear
input ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
end
egen Ever_Ins = max(Insured == 1), by(ID)
sort ID Year
list , sepby(ID)
+--------------------------------+
| ID Year Insured Ever_Ins |
|--------------------------------|
1. | 1 2001 1 1 |
2. | 1 2002 1 1 |
3. | 1 2003 1 1 |
|--------------------------------|
4. | 2 2001 0 1 |
5. | 2 2002 1 1 |
6. | 2 2003 0 1 |
|--------------------------------|
7. | 3 2001 0 0 |
8. | 3 2002 0 0 |
9. | 3 2003 0 0 |
+--------------------------------+
I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself
I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n