First and Last Occurrences in Stata - stata

I have a panel dataset from 2006 to 2012. I generated a new variable entry that takes a value of 1 for the firm that entered to a country. For instance if a firm has missing value (.) for its sales at time (t) it takes a value of 0 and at (t+1) if it enters to a country in other words has a value for its sales it takes a value of 1. The successful command that I used for this is as follows:
egen firm_id=group(firm country)
by firm_id (year), sort: gen byte entry = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
Since my data start from 2006 I excluded the observations for this year with the command:
bysort firm (year) : replace entry = 0 if year == 2006
However what I want is instead of having 0 values,
to have missing values for the subsequent years after its entry (e.g. at t+2 or t+3).
The same I applied for the exit but I changed the sort order of year:
gen nyear = -year
by firm_id (nyear), sort: gen byte exit = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
since the last observation year in my data is 2012 I excluded those observations:
bysort firm (year) : replace exit = 0 if year == 2012
Again here what I want is instead of having 0 values,
to have missing values for the subsequent years after its exit (e.g. at t+2 or t+3).

As I understand it the variable sales is missing when are none and positive otherwise.
You want indicators for a year being the first and last years of sales for a firm in a country.
I think this gets you most of the way. First, we need example data!
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm_id year sales)
1 2006 .
1 2007 .
1 2008 42
1 2009 42
1 2010 42
1 2011 .
1 2012 .
2 2006 .
2 2007 666
2 2008 666
2 2009 .
2 2010 .
2 2011 .
2 2012 .
end
The first and last dates are the minimum and maximum dates, conditional on there being sales.
egen first = min(cond(sales < ., year, .)), by(firm_id)
egen last = max(cond(sales < ., year, .)), by(firm_id)
For discussion of technique, see section 9 of this paper. Then (1, .) indicators follow directly
generate isfirst = cond(year == first, 1, .)
generate islast = cond(year == last, 1, .)
list, sepby(firm_id)
+----------------------------------------------------------+
| firm_id year sales first last isfirst islast |
|----------------------------------------------------------|
1. | 1 2006 . 2008 2010 . . |
2. | 1 2007 . 2008 2010 . . |
3. | 1 2008 42 2008 2010 1 . |
4. | 1 2009 42 2008 2010 . . |
5. | 1 2010 42 2008 2010 . 1 |
6. | 1 2011 . 2008 2010 . . |
7. | 1 2012 . 2008 2010 . . |
|----------------------------------------------------------|
8. | 2 2006 . 2007 2008 . . |
9. | 2 2007 666 2007 2008 1 . |
10. | 2 2008 666 2007 2008 . 1 |
11. | 2 2009 . 2007 2008 . . |
12. | 2 2010 . 2007 2008 . . |
13. | 2 2011 . 2007 2008 . . |
14. | 2 2012 . 2007 2008 . . |
+----------------------------------------------------------+
I have done anything different for 2006 or 2012. You could just build special rules into the cond() syntax.

Related

Unbalanced panel dataset

I am trying to construct a balanced dataset in Stata using xtbalance command with range() option. I have data for the years 2010-2013 and 2016. But the survey was not conducted in the years 2014 and 2015. Running xtbalance, range(2010 2016) fails as xtbalance does not realize that 2014 and 2015 are not there, and basically no observations are left in the constructed panel dataset.
How should I implement this? Is it possible to construct a balanced dataset for the years 2010-2013 and 2016? What command in Stata would allow me to do this?
This is more statistics than Stata, but here goes.
If you don't have data for 2014 and 2015, how is Stata expected to find them or estimate them? Only by imputation or interpolation, which would be a massive stretch.
You can get a balanced dataset with tsfill, but it's hard to see any point to that.
You could relabel years 2010-13 and 2016 as say "waves" 1 2 3 4 5, but that ignores the crucial gap.
Best to see and state the problem for what it is, a gap beyond your control.
xtbalance is from SSC, as you are asked to explain. (I see that it includes without permission or documentation code that I wrote, but that's a small deal.)
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id year whatever)
1 2010 1
1 2011 2
1 2012 3
1 2013 4
1 2016 5
2 2010 7
2 2011 8
2 2012 9
2 2013 10
2 2016 12
end
tsset id year
tsfill
list, sepby(id)
+----------------------+
| id year whatever |
|----------------------|
1. | 1 2010 1 |
2. | 1 2011 2 |
3. | 1 2012 3 |
4. | 1 2013 4 |
5. | 1 2014 . |
6. | 1 2015 . |
7. | 1 2016 5 |
|----------------------|
8. | 2 2010 7 |
9. | 2 2011 8 |
10. | 2 2012 9 |
11. | 2 2013 10 |
12. | 2 2014 . |
13. | 2 2015 . |
14. | 2 2016 12 |
+----------------------+

How to complete (fill) a panel dataset with a time variable that has a delta > 1?

Say I have a bi-yearly panel only with observations at odd years, such as
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
and I would like to fill up the missing even years. Here years 2012 and 2014 is missing for all ids.
input id year var
1 2011 23
1 2012 .
1 2013 12
1 2014 .
1 2015 11
2 2011 44
2 2012 .
2 2013 42
2 2014 .
2 2015 13
end
I had a look at help expand but I am unsure that's what I need, since it does not take the by prefix.
As a background info, I need to fill up with even years to able to merge with another panel data-set conducted in even years only
You can set the panel id as id and the time variable as year and use tsfill:
clear
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
xtset id year
tsfill
If the min and max year is not constant across panels, you could look at the ,full option.
. list
+-----------------+
| id year var |
|-----------------|
1. | 1 2011 23 |
2. | 1 2012 . |
3. | 1 2013 12 |
4. | 1 2014 . |
5. | 1 2015 11 |
|-----------------|
6. | 2 2011 44 |
7. | 2 2012 . |
8. | 2 2013 42 |
9. | 2 2014 . |
10. | 2 2015 13 |
+-----------------+

Computing Unemployment duration (Stata)

I have a panel survey that tracks the same workers with the latest year being 2019. I am interested in creating variables that capture the duration of no work for each individual, by taking the difference in years between the variables "left_firstjob_yr" and "start_secondjob_yr", which I did generate with:
gen no_work_duration_1 = (start_secondjob_yr  - left_firstjob_yr)
  where "no_work_duration_1" refers to the duration of no work between the date an individual left their first job and the date where they started their second job. 
However, one problem with my approach above is that it does not take into account workers who left their first job, but never worked again/left the workforce with missing "." values under the "start_secondjob_yr" column.
input int(start_firstjob_yr left_firstjob_yr start_secondjob_yr left_secondjob_yr)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
Ideally, I am trying to have my dataset looking as follows:
clear
input int(start_firstjob_yr left_firstjob_yr) byte no_work_duration_1 int(start_secondjob_yr left_secondjob_yr) byte no_work_duration_2
2014 2015 2 2017 2019 .
2014 . . . . .
2011 2014 5 . . .
2003 2008 3 2011 . .
2007 2009 3 2012 2014 2
It has been clarified that for workers with no second job, that the duration
of no_work_duration should be the difference between leaving their first job
and 2019 (see comments on original question).
I have taken the liberty of using some shorter variable names.
clear
input int(j1_start j1_end j2_start j2_end)
2014 2015 2017 2019
2014 . . .
2011 2014 . .
2003 2008 2011 .
2007 2009 2012 2014
end
* No Work Duration ("nwd")
gen nwd = j2_start - j1_end
* In cases of no second job:
replace nwd = 2019 - j1_end if missing(j2_start)
list
+---------------------------------------------+
| j1_start j1_end j2_start j2_end nwd |
|---------------------------------------------|
1. | 2014 2015 2017 2019 2 |
2. | 2014 . . . . |
3. | 2011 2014 . . 5 |
4. | 2003 2008 2011 . 3 |
5. | 2007 2009 2012 2014 3 |
+---------------------------------------------+

Trimming my panel dataset - filtering out observations meeting criterion if preceding ID meets the complementary criterion

I am working with a dataset that includes 118,979 observations over 9 wide variables in Stata 16.0. The most prominent variable is whether a company-observation over multiple dates reports either "GPS" or "EPS". These companies can report both a "GPS" observation in a datapoint, as well as an "EPS" observation in the following datapoint. Please refer to the data overview below for further visualisation.
Datasample:
clear
input str8 cusip8 str16 cname str4 measure double actual long anndats_act float(fyear tanalyst meanforcast UE)
"87482X10" "TALMER BANCORP" "EPS" 1.21 20118 2014 29 .8686207 .3930131
"87482X10" "TALMER BANCORP" "GPS" 1.02 20479 2015 34 .8576471 .1893004
I need to drop the GPS observations (over multiple dates) once an identifier (being cusip8 in the table above) has reported an EPS over multiple dates. That is, if a company has reported GPS as well as EPS in e.g. January 1st, 2010, I want to drop the GPS observation such that the EPS is kept.
If a company only reports a GPS, and does not report an EPS during a given date, I want to keep the GPS observation in my dataset.
The following works for me (adjust your variable names as required):
. clear
. input str10(company_id measure) month day year
company_id measure month day year
1. "Company A" "EPS" 1 1 2010
2. "Company A" "GPS" 1 1 2010
3. "Company A" "GPS" 1 1 2010
4. "Company A" "GPS" 1 2 2010
5. "Company B" "EPS" 1 2 2010
6. "Company B" "GPS" 1 1 2010
7. "Company C" "GPS" 1 4 2010
8. "Company C" "EPS" 1 4 2010
9. end
.
. gen date = mdy(month,day,year)
. format date %d
. drop month day year
.
. sort company_id date measure
.
. gen both = 0
. by company_id date: replace both = 1 if measure[1] == "EPS" & measure[2] == "GPS"
(5 real changes made)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 01jan2010 1 |
3. | Company A GPS 01jan2010 1 |
4. | Company A GPS 02jan2010 0 |
|----------------------------------------|
5. | Company B GPS 01jan2010 0 |
6. | Company B EPS 02jan2010 0 |
|----------------------------------------|
7. | Company C EPS 04jan2010 1 |
8. | Company C GPS 04jan2010 1 |
+----------------------------------------+
.
. drop if measure == "GPS" & both == 1
(3 observations deleted)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 02jan2010 0 |
|----------------------------------------|
3. | Company B GPS 01jan2010 0 |
4. | Company B EPS 02jan2010 0 |
|----------------------------------------|
5. | Company C EPS 04jan2010 1 |
+----------------------------------------+

(Stata) Find values from earlier years with multiple observations within a year

I'm using Stata. I have a dataset of multiple firms and their banks within a given year for multiple years. Since firms often have more than one bank there's multiple observations for a firm-year. I have a variable "bank_exityear" which contains the last year a bank is in the sample. I would like to create a variable that for each firm within a given year contains the minimum of "bank_exityear" from the previous year (and for the same firm).
An example data-set is attached here:
The variable I'd like to create is the bold "want". The data starts in 2008.
What would be the best way to create this variable?
Here's a solution using rangestat (from SSC). To install it, type in Stata's Command window:
ssc install rangestat
For the problem at hand, this requires finding the minimum bank_exityear across all observations of the same firmid whose year is one less than the year of the current observation:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
rangestat (min) bank_exityear, interval(year -1 -1) by(firmid)
list
and the results:
. list, sepby(firmid)
+-----------------------------------------------------+
| year firmid bankid bank_e~r want bank_e~n |
|-----------------------------------------------------|
1. | 2008 1 1 2008 . . |
2. | 2008 1 2 2015 . . |
3. | 2009 1 2 2015 2008 2008 |
4. | 2009 1 3 2015 2008 2008 |
5. | 2010 1 2 2015 2015 2015 |
6. | 2010 1 3 2015 2105 2015 |
+-----------------------------------------------------+
This sort of strategy might do the trick:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
tempfile min_year
preserve
collapse (min) want2 = bank_exityear, by(firmid year)
save `min_year'
restore
replace year = year - 1
merge m:1 firmid year using "`min_year'", nogen keep(master match)
replace year = year + 1
This assumes that there are no gaps in year.
your question is a little bit unclear but I believe some combination of
bysort bank_id (year) : gen lag_exit = bank_exit_year[_n-1]
bysort bank_id : egen min_var = min(lag_exit )
should work