(Stata) Find values from earlier years with multiple observations within a year - stata

I'm using Stata. I have a dataset of multiple firms and their banks within a given year for multiple years. Since firms often have more than one bank there's multiple observations for a firm-year. I have a variable "bank_exityear" which contains the last year a bank is in the sample. I would like to create a variable that for each firm within a given year contains the minimum of "bank_exityear" from the previous year (and for the same firm).
An example data-set is attached here:
The variable I'd like to create is the bold "want". The data starts in 2008.
What would be the best way to create this variable?

Here's a solution using rangestat (from SSC). To install it, type in Stata's Command window:
ssc install rangestat
For the problem at hand, this requires finding the minimum bank_exityear across all observations of the same firmid whose year is one less than the year of the current observation:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
rangestat (min) bank_exityear, interval(year -1 -1) by(firmid)
list
and the results:
. list, sepby(firmid)
+-----------------------------------------------------+
| year firmid bankid bank_e~r want bank_e~n |
|-----------------------------------------------------|
1. | 2008 1 1 2008 . . |
2. | 2008 1 2 2015 . . |
3. | 2009 1 2 2015 2008 2008 |
4. | 2009 1 3 2015 2008 2008 |
5. | 2010 1 2 2015 2015 2015 |
6. | 2010 1 3 2015 2105 2015 |
+-----------------------------------------------------+

This sort of strategy might do the trick:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
tempfile min_year
preserve
collapse (min) want2 = bank_exityear, by(firmid year)
save `min_year'
restore
replace year = year - 1
merge m:1 firmid year using "`min_year'", nogen keep(master match)
replace year = year + 1
This assumes that there are no gaps in year.

your question is a little bit unclear but I believe some combination of
bysort bank_id (year) : gen lag_exit = bank_exit_year[_n-1]
bysort bank_id : egen min_var = min(lag_exit )
should work

Related

Unbalanced panel dataset

I am trying to construct a balanced dataset in Stata using xtbalance command with range() option. I have data for the years 2010-2013 and 2016. But the survey was not conducted in the years 2014 and 2015. Running xtbalance, range(2010 2016) fails as xtbalance does not realize that 2014 and 2015 are not there, and basically no observations are left in the constructed panel dataset.
How should I implement this? Is it possible to construct a balanced dataset for the years 2010-2013 and 2016? What command in Stata would allow me to do this?
This is more statistics than Stata, but here goes.
If you don't have data for 2014 and 2015, how is Stata expected to find them or estimate them? Only by imputation or interpolation, which would be a massive stretch.
You can get a balanced dataset with tsfill, but it's hard to see any point to that.
You could relabel years 2010-13 and 2016 as say "waves" 1 2 3 4 5, but that ignores the crucial gap.
Best to see and state the problem for what it is, a gap beyond your control.
xtbalance is from SSC, as you are asked to explain. (I see that it includes without permission or documentation code that I wrote, but that's a small deal.)
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year whatever)
1 2010 1
1 2011 2
1 2012 3
1 2013 4
1 2016 5
2 2010 7
2 2011 8
2 2012 9
2 2013 10
2 2016 12
end
tsset id year
tsfill
list, sepby(id)
+----------------------+
| id year whatever |
|----------------------|
1. | 1 2010 1 |
2. | 1 2011 2 |
3. | 1 2012 3 |
4. | 1 2013 4 |
5. | 1 2014 . |
6. | 1 2015 . |
7. | 1 2016 5 |
|----------------------|
8. | 2 2010 7 |
9. | 2 2011 8 |
10. | 2 2012 9 |
11. | 2 2013 10 |
12. | 2 2014 . |
13. | 2 2015 . |
14. | 2 2016 12 |
+----------------------+

Computing Unemployment duration (Stata)

I have a panel survey that tracks the same workers with the latest year being 2019. I am interested in creating variables that capture the duration of no work for each individual, by taking the difference in years between the variables "left_firstjob_yr" and "start_secondjob_yr", which I did generate with:
gen no_work_duration_1 = (start_secondjob_yr  - left_firstjob_yr)
  where "no_work_duration_1" refers to the duration of no work between the date an individual left their first job and the date where they started their second job. 
However, one problem with my approach above is that it does not take into account workers who left their first job, but never worked again/left the workforce with missing "." values under the "start_secondjob_yr" column.
input int(start_firstjob_yr left_firstjob_yr start_secondjob_yr left_secondjob_yr)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
Ideally, I am trying to have my dataset looking as follows:
clear
input int(start_firstjob_yr left_firstjob_yr) byte no_work_duration_1 int(start_secondjob_yr left_secondjob_yr) byte no_work_duration_2
2014 2015 2 2017 2019 .
2014 . . . . .
2011 2014 5 . . .
2003 2008 3 2011 . .
2007 2009 3 2012 2014 2
It has been clarified that for workers with no second job, that the duration
of no_work_duration should be the difference between leaving their first job
and 2019 (see comments on original question).
I have taken the liberty of using some shorter variable names.
clear
input int(j1_start j1_end j2_start j2_end)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
end
* No Work Duration ("nwd")
gen nwd = j2_start - j1_end
* In cases of no second job:
replace nwd = 2019 - j1_end if missing(j2_start)
list
+---------------------------------------------+
| j1_start j1_end j2_start j2_end nwd |
|---------------------------------------------|
1. | 2014 2015 2017 2019 2 |
2. | 2014 . . . . |
3. | 2011 2014 . . 5 |
4. | 2003 2008 2011 . 3 |
5. | 2007 2009 2012 2014 3 |
+---------------------------------------------+

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,
bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

First and Last Occurrences in Stata

I have a panel dataset from 2006 to 2012. I generated a new variable entry that takes a value of 1 for the firm that entered to a country. For instance if a firm has missing value (.) for its sales at time (t) it takes a value of 0 and at (t+1) if it enters to a country in other words has a value for its sales it takes a value of 1. The successful command that I used for this is as follows:
egen firm_id=group(firm country)
by firm_id (year), sort: gen byte entry = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
Since my data start from 2006 I excluded the observations for this year with the command:
bysort firm (year) : replace entry = 0 if year == 2006
However what I want is instead of having 0 values,
to have missing values for the subsequent years after its entry (e.g. at t+2 or t+3).
The same I applied for the exit but I changed the sort order of year:
gen nyear = -year
by firm_id (nyear), sort: gen byte exit = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
since the last observation year in my data is 2012 I excluded those observations:
bysort firm (year) : replace exit = 0 if year == 2012
Again here what I want is instead of having 0 values,
to have missing values for the subsequent years after its exit (e.g. at t+2 or t+3).
As I understand it the variable sales is missing when are none and positive otherwise.
You want indicators for a year being the first and last years of sales for a firm in a country.
I think this gets you most of the way. First, we need example data!
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm_id year sales)
1 2006 .
1 2007 .
1 2008 42
1 2009 42
1 2010 42
1 2011 .
1 2012 .
2 2006 .
2 2007 666
2 2008 666
2 2009 .
2 2010 .
2 2011 .
2 2012 .
end
The first and last dates are the minimum and maximum dates, conditional on there being sales.
egen first = min(cond(sales < ., year, .)), by(firm_id)
egen last = max(cond(sales < ., year, .)), by(firm_id)
For discussion of technique, see section 9 of this paper. Then (1, .) indicators follow directly
generate isfirst = cond(year == first, 1, .)
generate islast = cond(year == last, 1, .)
list, sepby(firm_id)
+----------------------------------------------------------+
| firm_id year sales first last isfirst islast |
|----------------------------------------------------------|
1. | 1 2006 . 2008 2010 . . |
2. | 1 2007 . 2008 2010 . . |
3. | 1 2008 42 2008 2010 1 . |
4. | 1 2009 42 2008 2010 . . |
5. | 1 2010 42 2008 2010 . 1 |
6. | 1 2011 . 2008 2010 . . |
7. | 1 2012 . 2008 2010 . . |
|----------------------------------------------------------|
8. | 2 2006 . 2007 2008 . . |
9. | 2 2007 666 2007 2008 1 . |
10. | 2 2008 666 2007 2008 . 1 |
11. | 2 2009 . 2007 2008 . . |
12. | 2 2010 . 2007 2008 . . |
13. | 2 2011 . 2007 2008 . . |
14. | 2 2012 . 2007 2008 . . |
+----------------------------------------------------------+
I have done anything different for 2006 or 2012. You could just build special rules into the cond() syntax.

Creating a flag using indexes

I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+