How to copy the prior month's last observation's value to other observations? - stata

In Stata
Daily price data, permno is the company identifier
For each permno month, I want price end of last month
tsset permno date
gen m= mofd(date)
format m %tm
* price end of each month
bys permno m: gen prc_end= prc[_N]
* price end of the prior month
gen n=m-1
format n %tm
bys permno m: gen prc_endpm= prc_end if n==m[_n-1]
* no results
Current Data:
permno date prc m prc_end n
10026 24-Jan-19 145.8000031 2019m1 154.35 2018m12
10026 25-Jan-19 144.5500031 2019m1 154.35 2018m12
10026 28-Jan-19 140 2019m1 154.35 2018m12
10026 29-Jan-19 156.8200073 2019m1 154.35 2018m12
10026 30-Jan-19 150.5 2019m1 154.35 2018m12
10026 31-Jan-19 154.3500061 2019m1 154.35 2018m12
10026 01-Feb-19 154.8000031 2019m2 155.28 2019m1
10026 04-Feb-19 158.4400024 2019m2 155.28 2019m1
10026 05-Feb-19 158.2599945 2019m2 155.28 2019m1
10026 06-Feb-19 158.2400055 2019m2 155.28 2019m1
10026 07-Feb-19 156.4100037 2019m2 155.28 2019m1

To make things clearer, consider a silly example dataset.
clear
set obs 6
gen permno = 1
gen date = mdy(1 + (_n > 3), real(word("1 15 31 1 15 28", _n)), 2022)
format date %td
gen m = mofd(date)
format m %tm
gen n = m - 1
format n %tm
gen price = _n
list, sepby(permno m)
+-----------------------------------------------+
| permno date m n price |
|-----------------------------------------------|
1. | 1 01jan2022 2022m1 2021m12 1 |
2. | 1 15jan2022 2022m1 2021m12 2 |
3. | 1 31jan2022 2022m1 2021m12 3 |
|-----------------------------------------------|
4. | 1 01feb2022 2022m2 2022m1 4 |
5. | 1 15feb2022 2022m2 2022m1 5 |
6. | 1 28feb2022 2022m2 2022m1 6 |
+-----------------------------------------------+
Now
bys permno m: gen prc_end= price[_N]
will probably work, but this would be safer:
bys permno m (date): gen prc_end= price[_N]
Things go wrong when you go
bys permno m: gen prc_endpm= prc_end if n==m[_n-1]
The effect of the by: is to confine calculations to blocks with the same permno and monthly date. [_n-1] here is legal but it refers to the previous observation in the same block of observations (usefully if there is one; if there isn't the code is still legal).
You want [_n-1] to refer to the previous month (and the same permno) but that is not what your syntax means. Also, the example shows that, although your syntax is legal, there are no observations that satisfy your if condition, as m and n are never equal within the same block of observations.
What you want can be done with by: but you need to look across months.
This should do it:
bysort permno (m date) : gen previous = price[_n-1] if m[_n-1] == m -1
bysort permno m (date) : replace previous = previous[1]
list, sepby(permno m)
+----------------------------------------------------------+
| permno date m n price previous |
|----------------------------------------------------------|
1. | 1 01jan2022 2022m1 2021m12 1 . |
2. | 1 15jan2022 2022m1 2021m12 2 . |
3. | 1 31jan2022 2022m1 2021m12 3 . |
|----------------------------------------------------------|
4. | 1 01feb2022 2022m2 2022m1 4 3 |
5. | 1 15feb2022 2022m2 2022m1 5 3 |
6. | 1 28feb2022 2022m2 2022m1 6 3 |
+----------------------------------------------------------+

Related

How to make moving average across groups?

| month | year | amount|
|-------|--------|-------|
| 1 | 2010 | 26 |
| 1 | 2010 | 26 |
| 2 | 2010 | 30 |
| 3 | 2010 | 35 |
| 3 | 2010 | 35 |
I need to figure out how to make another variable, that takes the prior month amount _n-1 and _n and divide it by 2, kind of like a moving average. The problem is that I need to do it by month and year, since there are multiples of the same month and year. There are other variables as well that are irrelevant, but that is why I can't just delete duplicates.
For example, for observation 5, I would need it to be (35+30+26) / 3
Your prescription and your example don't match at all. Your example is a mean of 3 monthly means, this month and the two previous. Your prescription is a month and the month previous.
Here is some technique that focuses on two possible meanings of your prescription.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte month int year byte amount
1 2010 26
1 2010 26
2 2010 30
3 2010 35
3 2010 35
end
gen mdate = ym(year, month)
format mdate %tm
foreach w in total mean count {
egen `w' = `w'(amount), by(mdate)
}
gen wanted1 = (mean + mean[_n-1]) / 2 if mdate == mdate[_n-1] + 1
bysort mdate (wanted1) : replace wanted1 = wanted1[_n-1] if missing(wanted1)
gen wanted2 = (total + total[_n-1]) / (count + count[_n-1]) if mdate == mdate[_n-1] + 1
bysort mdate (wanted2) : replace wanted2 = wanted2[_n-1] if missing(wanted2)
list, sepby(mdate)
+----------------------------------------------------------------------------+
| month year amount mdate total mean count wanted1 wanted2 |
|----------------------------------------------------------------------------|
1. | 1 2010 26 2010m1 52 26 2 . . |
2. | 1 2010 26 2010m1 52 26 2 . . |
|----------------------------------------------------------------------------|
3. | 2 2010 30 2010m2 30 30 1 28 27.33333 |
|----------------------------------------------------------------------------|
4. | 3 2010 35 2010m3 70 35 2 32.5 33.33333 |
5. | 3 2010 35 2010m3 70 35 2 32.5 33.33333 |
+----------------------------------------------------------------------------+

Computing running sum with moving time-window

My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).
Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}

Count concurrent subscriptions

I have a database with a number of people who (may) have multiple subscriptions to a service running at once and transactional data for each event during the life of their subscription. I am trying to create a variable that counts the number of current active subscriptions the user has at a given transaction time.
To illustrate an example, my data lives in the form:
person | subscription | obs_date | sub_start_date | sub_end_date | num_concurrent_subs
--------------------------------------------------------------------------------------
1 | 1 | 09/01/10 | 09/01/10 | 09/01/11 | 1
1 | 1 | 10/01/10 | 09/01/10 | 09/01/11 | 2
1 | 1 | 11/01/10 | 09/01/10 | 09/01/11 | 2
1 | 2 | 10/01/10 | 10/01/10 | 09/01/11 | 2
1 | 2 | 11/01/10 | 10/01/10 | 09/01/11 | 2
1 | 3 | 11/01/14 | 09/01/14 | . | 1
1 | 3 | 11/01/16 | 09/01/14 | . | 1
1 | 4 | 11/01/15 | 10/01/15 | 11/01/15 | 3
1 | 5 | 11/01/15 | 10/01/15 | 11/01/15 | 3
And so on and so forth for each person. I want to generate the num_concurrent_subs as above.
That is, for each person, look at each observation and find how many subscriptions it falls into the range sub_start_date to sub_end_date.
I've read a bit on Stata's count function and believe I'm close to a solution, but I'm not sure how to check it across different subscriptions.
You can do this by separating the subscription information from the transaction data and convert the subscription data to long form, with one observation for the start date and another for the end date. Then you recombine with the transaction data and order by a single date variable. You use an onoff variable to track the start and end of each subscription. Something like:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "." 1
1 3 "11/01/16" "09/01/14" "." 1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end
* should always have an observation identifier
gen obsid = _n
* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace
* reduce to subscription info with one obs for the start and one obs
* for the end of each subscription. use an onoff variable to tract
* start and end events
keep person subscription substart subend
bysort person subscription substart subend: keep if _n == 1
expand 2
bysort person subscription: gen adate = cond(_n == 1, substart, subend)
by person subscription: gen onoff = cond(_n == 1, 1, -1)
replace onoff = 0 if mi(adate)
format %td adate
append using "main_data.dta"
* include obs date in adate and nothing happens on the observation date
replace adate = odate if !mi(obsid)
replace onoff = 0 if !mi(obsid)
* order by person adate, put on event first, then obs events, then off events
gsort person adate -onoff
by person: gen concur = sum(onoff)
* return to original obs
keep if !mi(obsid)
sort obsid
Here's another way to do this using rangejoin (from SSC). To install it, type in Stata's Command window:
ssc install rangejoin
With rangejoin, you can pair each subscription with all transactional data that falls within the subscription start and end date. Then, it's just a matter of counting, per transaction observation, how many subscription is it paired with.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "." 1
1 3 "11/01/16" "09/01/14" "." 1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end
* should always have an observation identifier
gen obsid = _n
* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace
* reduce to subscription start and end date per person
bysort person subscription substart subend: keep if _n == 1
keep person substart subend
* missing values will exclude obs so use a date in the future
replace subend = mdy(1,1,2099) if mi(subend)
* pair each subscription with an obs date
rangejoin odate substart subend using "main_data.dta", by(person)
* the number of current subcription is the number of pairings
bysort obsid: gen current = _N
* return to original obs
by obsid: keep if _n == 1
sort obsid
drop substart subend
rename (substart_U subend_U) (substart subend)

How to collapse numbers with same identifier but different date, but preserve the date of first observation for each identifier

I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+

Compute year on year growth of GDP for each quarter

I want to compute in Stata year on year growth rate of the GDP for each quarter. Basically, I want to compute : (gdp_q1y1980-gdp_q1y1979)/gdp_q1y1979.
// create some silly example data
clear
set obs 10
gen time = _n
format time %tq
gen gdp = _n^2*100
// do the computation
tsset time
gen growth = S4.gdp / L4.gdp
// admire the result
list
For more information see here.
I prefer #Maarten Buis' solution but know also that you could use subscripting:
sort time
gen growth = (gdp / gdp[_n-4]) - 1
Run help subscripting for details (or http://www.stata.com/help.cgi?subscripting).
Note that extra care must be taken if, for example, there is a gap in your time series:
. clear all
. set more off
.
. // create some silly example data
. set obs 15
obs was 0, now 15
. gen time = _n
. format time %tq
. gen gdp = _n^2*100
.
. // create a gap deleting 1962q4
. drop in 11
(1 observation deleted)
.
. // using -tsset-
. tsset time
time variable: time, 1960q2 to 1963q4, but with a gap
delta: 1 quarter
. gen growth = S4.gdp / L4.gdp
(5 missing values generated)
.
. // subscripting
. sort time
. gen growth2 = (gdp / gdp[_n-4]) - 1
(4 missing values generated)
.
. list, separator(0)
+--------------------------------------+
| time gdp growth growth2 |
|--------------------------------------|
1. | 1960q2 100 . . |
2. | 1960q3 400 . . |
3. | 1960q4 900 . . |
4. | 1961q1 1600 . . |
5. | 1961q2 2500 24 24 |
6. | 1961q3 3600 8 8 |
7. | 1961q4 4900 4.444445 4.444445 |
8. | 1962q1 6400 3 3 |
9. | 1962q2 8100 2.24 2.24 |
10. | 1962q3 10000 1.777778 1.777778 |
11. | 1963q1 14400 1.25 1.938776 |
12. | 1963q2 16900 1.08642 1.640625 |
13. | 1963q3 19600 .96 1.419753 |
14. | 1963q4 22500 . 1.25 |
+--------------------------------------+
The results for the solution with subscripts (variable growth2) are messed up once the gap begins (1963q1). A good reason, I think, to prefer tsset.