I have a panel survey that tracks the same workers with the latest year being 2019. I am interested in creating variables that capture the duration of no work for each individual, by taking the difference in years between the variables "left_firstjob_yr" and "start_secondjob_yr", which I did generate with:
gen no_work_duration_1 = (start_secondjob_yr - left_firstjob_yr)
where "no_work_duration_1" refers to the duration of no work between the date an individual left their first job and the date where they started their second job.
However, one problem with my approach above is that it does not take into account workers who left their first job, but never worked again/left the workforce with missing "." values under the "start_secondjob_yr" column.
input int(start_firstjob_yr left_firstjob_yr start_secondjob_yr left_secondjob_yr)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
Ideally, I am trying to have my dataset looking as follows:
clear
input int(start_firstjob_yr left_firstjob_yr) byte no_work_duration_1 int(start_secondjob_yr left_secondjob_yr) byte no_work_duration_2
2014 2015 2 2017 2019 .
2014 . . . . .
2011 2014 5 . . .
2003 2008 3 2011 . .
2007 2009 3 2012 2014 2
It has been clarified that for workers with no second job, that the duration
of no_work_duration should be the difference between leaving their first job
and 2019 (see comments on original question).
I have taken the liberty of using some shorter variable names.
clear
input int(j1_start j1_end j2_start j2_end)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
end
* No Work Duration ("nwd")
gen nwd = j2_start - j1_end
* In cases of no second job:
replace nwd = 2019 - j1_end if missing(j2_start)
list
+---------------------------------------------+
| j1_start j1_end j2_start j2_end nwd |
|---------------------------------------------|
1. | 2014 2015 2017 2019 2 |
2. | 2014 . . . . |
3. | 2011 2014 . . 5 |
4. | 2003 2008 2011 . 3 |
5. | 2007 2009 2012 2014 3 |
+---------------------------------------------+
Related
I am trying to construct a balanced dataset in Stata using xtbalance command with range() option. I have data for the years 2010-2013 and 2016. But the survey was not conducted in the years 2014 and 2015. Running xtbalance, range(2010 2016) fails as xtbalance does not realize that 2014 and 2015 are not there, and basically no observations are left in the constructed panel dataset.
How should I implement this? Is it possible to construct a balanced dataset for the years 2010-2013 and 2016? What command in Stata would allow me to do this?
This is more statistics than Stata, but here goes.
If you don't have data for 2014 and 2015, how is Stata expected to find them or estimate them? Only by imputation or interpolation, which would be a massive stretch.
You can get a balanced dataset with tsfill, but it's hard to see any point to that.
You could relabel years 2010-13 and 2016 as say "waves" 1 2 3 4 5, but that ignores the crucial gap.
Best to see and state the problem for what it is, a gap beyond your control.
xtbalance is from SSC, as you are asked to explain. (I see that it includes without permission or documentation code that I wrote, but that's a small deal.)
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year whatever)
1 2010 1
1 2011 2
1 2012 3
1 2013 4
1 2016 5
2 2010 7
2 2011 8
2 2012 9
2 2013 10
2 2016 12
end
tsset id year
tsfill
list, sepby(id)
+----------------------+
| id year whatever |
|----------------------|
1. | 1 2010 1 |
2. | 1 2011 2 |
3. | 1 2012 3 |
4. | 1 2013 4 |
5. | 1 2014 . |
6. | 1 2015 . |
7. | 1 2016 5 |
|----------------------|
8. | 2 2010 7 |
9. | 2 2011 8 |
10. | 2 2012 9 |
11. | 2 2013 10 |
12. | 2 2014 . |
13. | 2 2015 . |
14. | 2 2016 12 |
+----------------------+
Say I have a bi-yearly panel only with observations at odd years, such as
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
and I would like to fill up the missing even years. Here years 2012 and 2014 is missing for all ids.
input id year var
1 2011 23
1 2012 .
1 2013 12
1 2014 .
1 2015 11
2 2011 44
2 2012 .
2 2013 42
2 2014 .
2 2015 13
end
I had a look at help expand but I am unsure that's what I need, since it does not take the by prefix.
As a background info, I need to fill up with even years to able to merge with another panel data-set conducted in even years only
You can set the panel id as id and the time variable as year and use tsfill:
clear
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
xtset id year
tsfill
If the min and max year is not constant across panels, you could look at the ,full option.
. list
+-----------------+
| id year var |
|-----------------|
1. | 1 2011 23 |
2. | 1 2012 . |
3. | 1 2013 12 |
4. | 1 2014 . |
5. | 1 2015 11 |
|-----------------|
6. | 2 2011 44 |
7. | 2 2012 . |
8. | 2 2013 42 |
9. | 2 2014 . |
10. | 2 2015 13 |
+-----------------+
I have the following dataset
A B begin_yr end_yr
asset brown 2007 2010
asset blue 2008 2008
basics caramel 2015 2015
cows dork 2004 2006
I want A and B to have rows for each year represented.
I expanded for each year:
gen x = end_yr - begin_yr
expand x +1
This gives me the following:
A B begin_yr end_yr x
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset blue 2008 2008 0
basics caramel 2015 2015 0
cows dork 2004 2006 2
Ultimately, I want the following dataset:
A B begin_yr end_yr x year
asset brown 2007 2010 3 2007
asset brown 2007 2010 3 2008
asset brown 2007 2010 3 2009
asset brown 2007 2010 3 2010
asset blue 2008 2008 0 2008
basics caramel 2015 2015 0 2015
cows dork 2004 2006 2 2004
cows dork 2004 2006 2 2005
cows dork 2004 2006 2 2006
This is what I have so far:
gen year = begin_yr if begin_yr!=end_yr
How do I populate the rest of the variable year?
Here's a twist building on #Pearly Spencer's code:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
gen toexpand = end - begin + 1
expand toexpand
bysort A : gen year = begin + _n - 1
list, sepby(A)
+--------------------------------------------------------+
| A B begin_yr end_yr toexpand year |
|--------------------------------------------------------|
1. | asset brown 2007 2010 4 2007 |
2. | asset brown 2007 2010 4 2008 |
3. | asset brown 2007 2010 4 2009 |
4. | asset brown 2007 2010 4 2010 |
|--------------------------------------------------------|
5. | basics caramel 2015 2015 1 2015 |
|--------------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------------+
Nothing against tsset or tsfill but neither is needed for this.
The following works for me:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
generate id = _n
expand 2
clonevar year = begin_yr
bysort id: replace year = end_yr[2] if _n == _N
drop if _n == 3
tsset id year
tsfill
foreach var in A B begin_yr end_yr {
bysort id: replace `var' = `var'[1]
}
list
+--------------------------------------------------+
| A B begin_yr end_yr id year |
|--------------------------------------------------|
1. | asset brown 2007 2010 1 2007 |
2. | asset brown 2007 2010 1 2008 |
3. | asset brown 2007 2010 1 2009 |
4. | asset brown 2007 2010 1 2010 |
5. | basics caramel 2015 2015 2 2015 |
|--------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------+
I have a panel dataset from 2006 to 2012. I generated a new variable entry that takes a value of 1 for the firm that entered to a country. For instance if a firm has missing value (.) for its sales at time (t) it takes a value of 0 and at (t+1) if it enters to a country in other words has a value for its sales it takes a value of 1. The successful command that I used for this is as follows:
egen firm_id=group(firm country)
by firm_id (year), sort: gen byte entry = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
Since my data start from 2006 I excluded the observations for this year with the command:
bysort firm (year) : replace entry = 0 if year == 2006
However what I want is instead of having 0 values,
to have missing values for the subsequent years after its entry (e.g. at t+2 or t+3).
The same I applied for the exit but I changed the sort order of year:
gen nyear = -year
by firm_id (nyear), sort: gen byte exit = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
since the last observation year in my data is 2012 I excluded those observations:
bysort firm (year) : replace exit = 0 if year == 2012
Again here what I want is instead of having 0 values,
to have missing values for the subsequent years after its exit (e.g. at t+2 or t+3).
As I understand it the variable sales is missing when are none and positive otherwise.
You want indicators for a year being the first and last years of sales for a firm in a country.
I think this gets you most of the way. First, we need example data!
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm_id year sales)
1 2006 .
1 2007 .
1 2008 42
1 2009 42
1 2010 42
1 2011 .
1 2012 .
2 2006 .
2 2007 666
2 2008 666
2 2009 .
2 2010 .
2 2011 .
2 2012 .
end
The first and last dates are the minimum and maximum dates, conditional on there being sales.
egen first = min(cond(sales < ., year, .)), by(firm_id)
egen last = max(cond(sales < ., year, .)), by(firm_id)
For discussion of technique, see section 9 of this paper. Then (1, .) indicators follow directly
generate isfirst = cond(year == first, 1, .)
generate islast = cond(year == last, 1, .)
list, sepby(firm_id)
+----------------------------------------------------------+
| firm_id year sales first last isfirst islast |
|----------------------------------------------------------|
1. | 1 2006 . 2008 2010 . . |
2. | 1 2007 . 2008 2010 . . |
3. | 1 2008 42 2008 2010 1 . |
4. | 1 2009 42 2008 2010 . . |
5. | 1 2010 42 2008 2010 . 1 |
6. | 1 2011 . 2008 2010 . . |
7. | 1 2012 . 2008 2010 . . |
|----------------------------------------------------------|
8. | 2 2006 . 2007 2008 . . |
9. | 2 2007 666 2007 2008 1 . |
10. | 2 2008 666 2007 2008 . 1 |
11. | 2 2009 . 2007 2008 . . |
12. | 2 2010 . 2007 2008 . . |
13. | 2 2011 . 2007 2008 . . |
14. | 2 2012 . 2007 2008 . . |
+----------------------------------------------------------+
I have done anything different for 2006 or 2012. You could just build special rules into the cond() syntax.
I'm using Stata. I have a dataset of multiple firms and their banks within a given year for multiple years. Since firms often have more than one bank there's multiple observations for a firm-year. I have a variable "bank_exityear" which contains the last year a bank is in the sample. I would like to create a variable that for each firm within a given year contains the minimum of "bank_exityear" from the previous year (and for the same firm).
An example data-set is attached here:
The variable I'd like to create is the bold "want". The data starts in 2008.
What would be the best way to create this variable?
Here's a solution using rangestat (from SSC). To install it, type in Stata's Command window:
ssc install rangestat
For the problem at hand, this requires finding the minimum bank_exityear across all observations of the same firmid whose year is one less than the year of the current observation:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
rangestat (min) bank_exityear, interval(year -1 -1) by(firmid)
list
and the results:
. list, sepby(firmid)
+-----------------------------------------------------+
| year firmid bankid bank_e~r want bank_e~n |
|-----------------------------------------------------|
1. | 2008 1 1 2008 . . |
2. | 2008 1 2 2015 . . |
3. | 2009 1 2 2015 2008 2008 |
4. | 2009 1 3 2015 2008 2008 |
5. | 2010 1 2 2015 2015 2015 |
6. | 2010 1 3 2015 2105 2015 |
+-----------------------------------------------------+
This sort of strategy might do the trick:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
tempfile min_year
preserve
collapse (min) want2 = bank_exityear, by(firmid year)
save `min_year'
restore
replace year = year - 1
merge m:1 firmid year using "`min_year'", nogen keep(master match)
replace year = year + 1
This assumes that there are no gaps in year.
your question is a little bit unclear but I believe some combination of
bysort bank_id (year) : gen lag_exit = bank_exit_year[_n-1]
bysort bank_id : egen min_var = min(lag_exit )
should work