I have the following dataset
A B begin_yr end_yr
asset brown 2007 2010
asset blue 2008 2008
basics caramel 2015 2015
cows dork 2004 2006
I want A and B to have rows for each year represented.
I expanded for each year:
gen x = end_yr - begin_yr
expand x +1
This gives me the following:
A B begin_yr end_yr x
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset blue 2008 2008 0
basics caramel 2015 2015 0
cows dork 2004 2006 2
Ultimately, I want the following dataset:
A B begin_yr end_yr x year
asset brown 2007 2010 3 2007
asset brown 2007 2010 3 2008
asset brown 2007 2010 3 2009
asset brown 2007 2010 3 2010
asset blue 2008 2008 0 2008
basics caramel 2015 2015 0 2015
cows dork 2004 2006 2 2004
cows dork 2004 2006 2 2005
cows dork 2004 2006 2 2006
This is what I have so far:
gen year = begin_yr if begin_yr!=end_yr
How do I populate the rest of the variable year?
Here's a twist building on #Pearly Spencer's code:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
gen toexpand = end - begin + 1
expand toexpand
bysort A : gen year = begin + _n - 1
list, sepby(A)
+--------------------------------------------------------+
| A B begin_yr end_yr toexpand year |
|--------------------------------------------------------|
1. | asset brown 2007 2010 4 2007 |
2. | asset brown 2007 2010 4 2008 |
3. | asset brown 2007 2010 4 2009 |
4. | asset brown 2007 2010 4 2010 |
|--------------------------------------------------------|
5. | basics caramel 2015 2015 1 2015 |
|--------------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------------+
Nothing against tsset or tsfill but neither is needed for this.
The following works for me:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
generate id = _n
expand 2
clonevar year = begin_yr
bysort id: replace year = end_yr[2] if _n == _N
drop if _n == 3
tsset id year
tsfill
foreach var in A B begin_yr end_yr {
bysort id: replace `var' = `var'[1]
}
list
+--------------------------------------------------+
| A B begin_yr end_yr id year |
|--------------------------------------------------|
1. | asset brown 2007 2010 1 2007 |
2. | asset brown 2007 2010 1 2008 |
3. | asset brown 2007 2010 1 2009 |
4. | asset brown 2007 2010 1 2010 |
5. | basics caramel 2015 2015 2 2015 |
|--------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------+
Related
I have a dataset with the following structure:
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
I would like to count how many distinct documents there are in each group of year-state and include zeros. In other words, I want my output like this:
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2009 AK 0 |
| 2006 AL 1 |
| 2007 AL 0 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
I tried to solve this problem using tag function from egen command:
egen tag = tag(year state document)
egen n_documents = total(tag), by(year state)
collapse (first) n_documents, by(year state)
sort state year
list, sep(0) abb(20)
However, I end up with the following dataset (without zeros):
+----------------------------+
| year state n_documents |
|----------------------------|
| 2006 AK 2 |
| 2007 AK 2 |
| 2008 AK 1 |
| 2006 AL 1 |
| 2008 AL 3 |
| 2009 AL 2 |
| 2006 AS 3 |
| 2007 AS 1 |
| 2008 AS 1 |
| 2009 AS 3 |
+----------------------------+
Of course, I could just include manually the remaining combinations of year-state without documents, but in real life my dataset has almost one million observations, so a manual solution is not practical here. Is there a way to do this in stata?
Here's another way to do it.
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
contract year state document, zero freq(distinct)
replace distinct = distinct > 0
collapse (sum) distinct, by(state year)
list , sepby(state)
+-------------------------+
| year state distinct |
|-------------------------|
1. | 2006 AK 2 |
2. | 2007 AK 2 |
3. | 2008 AK 1 |
4. | 2009 AK 0 |
|-------------------------|
5. | 2006 AL 1 |
6. | 2007 AL 0 |
7. | 2008 AL 3 |
8. | 2009 AL 2 |
|-------------------------|
9. | 2006 AS 3 |
10. | 2007 AS 1 |
11. | 2008 AS 1 |
12. | 2009 AS 3 |
+-------------------------+
EDIT #Romalpa Akzo pointed to this more direct solution
contract state year document, nomiss
contract state year, freq(n_document) zero
Thanks for the data example and clear description. One trick to do this is to reshape wide and back to long, then replace missings with 0.
clear
input year str2 state str11 document
2009 AS 09420849920
2006 AS 91444492147
2008 AS 91444492147
2007 AK 47080474742
2006 AK 90190072284
2007 AK 90190072284
2006 AK 10744281448
2009 AL 22408712220
2006 AS 92974278888
2008 AL 27189228210
2009 AS 92974278888
2009 AS 22408712220
2009 AL 92974278888
2006 AS 27189228210
2007 AS 91444492147
2006 AL 27189228210
2008 AL 47080474742
2008 AL 10744281448
2008 AK 09420849920
2008 AL 47080474742
end
egen tag = tag(year state document)
collapse (sum) n_documents=tag, by(state year)
reshape wide n_documents, i(state) j(year)
reshape long
mvencode n_documents, mv(0)
Say I have a bi-yearly panel only with observations at odd years, such as
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
and I would like to fill up the missing even years. Here years 2012 and 2014 is missing for all ids.
input id year var
1 2011 23
1 2012 .
1 2013 12
1 2014 .
1 2015 11
2 2011 44
2 2012 .
2 2013 42
2 2014 .
2 2015 13
end
I had a look at help expand but I am unsure that's what I need, since it does not take the by prefix.
As a background info, I need to fill up with even years to able to merge with another panel data-set conducted in even years only
You can set the panel id as id and the time variable as year and use tsfill:
clear
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
xtset id year
tsfill
If the min and max year is not constant across panels, you could look at the ,full option.
. list
+-----------------+
| id year var |
|-----------------|
1. | 1 2011 23 |
2. | 1 2012 . |
3. | 1 2013 12 |
4. | 1 2014 . |
5. | 1 2015 11 |
|-----------------|
6. | 2 2011 44 |
7. | 2 2012 . |
8. | 2 2013 42 |
9. | 2 2014 . |
10. | 2 2015 13 |
+-----------------+
I have a city-year level panel data that looks like:
city year mayor growth
Orange 2001 A 9.599998
Orange 2002 A 14.9
Orange 2003 A 14.6
Orange 2004 A 13
Orange 2005 B 9
Orange 2006 B 12.7
Orange 2007 C 18.4
Orange 2008 D 20.7
Orange 2009 D 16.5
I want to calculate for each mayor in each city:
(1) the growth in his first year as the mayor
(2) the rolling average growth since his first year
My code is:
bysort city mayor (year) : gen rollavg = sum(growth) / sum(growth < .) if mayor != "" & growth != .
bysort city mayor (year) : gen year1growth = growth[1]
It works for most of the data, but for some city/mayor Stata returns random numbers:
city year mayor growth year1growth rollavg
Orange 2001 A 9.599998 10345.59 10345.59
Orange 2002 A 14.9 10345.59 5180.245
Orange 2003 A 14.6 10345.59 3458.363
Orange 2004 A 13 10345.59 .
Orange 2005 B 9 9 9
Orange 2006 B 12.7 9 10.85
Orange 2007 C 18.4 18.4 18.4
Orange 2008 D 20.7 20.7 20.7
Orange 2009 D 16.5 20.7 18.6
Orange 2010 D 12.5 20.7 16.56667
For example, it works for mayor D: Year1Growth = 20.7 which is the growth rate in his first year 2008. Rolling average also works, 18.6 = (20.7+16.5)/2 and 16.56 = (20.7+16.5+12.5)/3.
However, the numbers are totally wrong for mayor A.
Does anyone know how to fix this?
I can't reproduce this. I note that your example is inconsistent as your variable names for your data example and for your results are not identical. (Corrected in an edit to the original question: my data example and results are consistent in names, but different.)
My only guesses are that
There is some confusion about what is in what variable in your real dataset.
Although you are showing us what look like numeric variables, some or all of those variables may have been produced by an encode of an original string variable that contained numbers. That is a known way to produce complete garbage. The advice is always to use destring, not encode.
The if qualifier is irrelevant to this example.
* Example generated by -dataex-. For more info, type help dataex
clear
input str6 City int Year str1 Mayor double Growth
"Orange" 2001 "A" 9.599998
"Orange" 2002 "A" 14.9
"Orange" 2003 "A" 14.6
"Orange" 2004 "A" 13
"Orange" 2005 "B" 9
"Orange" 2006 "B" 12.7
"Orange" 2007 "C" 18.4
"Orange" 2008 "D" 20.7
"Orange" 2009 "D" 16.5
end
bysort City Mayor (Year) : gen rollavg = sum(Growth) / sum(Growth < .)
bysort City Mayor (Year) : gen year1growth = Growth[1]
list, sepby(Mayor)
+--------------------------------------------------------+
| City Year Mayor Growth rollavg year1g~h |
|--------------------------------------------------------|
1. | Orange 2001 A 9.599998 9.599998 9.599998 |
2. | Orange 2002 A 14.9 12.25 9.599998 |
3. | Orange 2003 A 14.6 13.03333 9.599998 |
4. | Orange 2004 A 13 13.025 9.599998 |
|--------------------------------------------------------|
5. | Orange 2005 B 9 9 9 |
6. | Orange 2006 B 12.7 10.85 9 |
|--------------------------------------------------------|
7. | Orange 2007 C 18.4 18.4 18.4 |
|--------------------------------------------------------|
8. | Orange 2008 D 20.7 20.7 20.7 |
9. | Orange 2009 D 16.5 18.6 20.7 |
+--------------------------------------------------------+
I have a panel dataset from 2006 to 2012. I generated a new variable entry that takes a value of 1 for the firm that entered to a country. For instance if a firm has missing value (.) for its sales at time (t) it takes a value of 0 and at (t+1) if it enters to a country in other words has a value for its sales it takes a value of 1. The successful command that I used for this is as follows:
egen firm_id=group(firm country)
by firm_id (year), sort: gen byte entry = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
Since my data start from 2006 I excluded the observations for this year with the command:
bysort firm (year) : replace entry = 0 if year == 2006
However what I want is instead of having 0 values,
to have missing values for the subsequent years after its entry (e.g. at t+2 or t+3).
The same I applied for the exit but I changed the sort order of year:
gen nyear = -year
by firm_id (nyear), sort: gen byte exit = ///
sum(inrange(sales, 0,.)) == 1 & sum(inrange(sales[_n - 1],0,.)) == 0
since the last observation year in my data is 2012 I excluded those observations:
bysort firm (year) : replace exit = 0 if year == 2012
Again here what I want is instead of having 0 values,
to have missing values for the subsequent years after its exit (e.g. at t+2 or t+3).
As I understand it the variable sales is missing when are none and positive otherwise.
You want indicators for a year being the first and last years of sales for a firm in a country.
I think this gets you most of the way. First, we need example data!
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm_id year sales)
1 2006 .
1 2007 .
1 2008 42
1 2009 42
1 2010 42
1 2011 .
1 2012 .
2 2006 .
2 2007 666
2 2008 666
2 2009 .
2 2010 .
2 2011 .
2 2012 .
end
The first and last dates are the minimum and maximum dates, conditional on there being sales.
egen first = min(cond(sales < ., year, .)), by(firm_id)
egen last = max(cond(sales < ., year, .)), by(firm_id)
For discussion of technique, see section 9 of this paper. Then (1, .) indicators follow directly
generate isfirst = cond(year == first, 1, .)
generate islast = cond(year == last, 1, .)
list, sepby(firm_id)
+----------------------------------------------------------+
| firm_id year sales first last isfirst islast |
|----------------------------------------------------------|
1. | 1 2006 . 2008 2010 . . |
2. | 1 2007 . 2008 2010 . . |
3. | 1 2008 42 2008 2010 1 . |
4. | 1 2009 42 2008 2010 . . |
5. | 1 2010 42 2008 2010 . 1 |
6. | 1 2011 . 2008 2010 . . |
7. | 1 2012 . 2008 2010 . . |
|----------------------------------------------------------|
8. | 2 2006 . 2007 2008 . . |
9. | 2 2007 666 2007 2008 1 . |
10. | 2 2008 666 2007 2008 . 1 |
11. | 2 2009 . 2007 2008 . . |
12. | 2 2010 . 2007 2008 . . |
13. | 2 2011 . 2007 2008 . . |
14. | 2 2012 . 2007 2008 . . |
+----------------------------------------------------------+
I have done anything different for 2006 or 2012. You could just build special rules into the cond() syntax.
I'm using Stata. I have a dataset of multiple firms and their banks within a given year for multiple years. Since firms often have more than one bank there's multiple observations for a firm-year. I have a variable "bank_exityear" which contains the last year a bank is in the sample. I would like to create a variable that for each firm within a given year contains the minimum of "bank_exityear" from the previous year (and for the same firm).
An example data-set is attached here:
The variable I'd like to create is the bold "want". The data starts in 2008.
What would be the best way to create this variable?
Here's a solution using rangestat (from SSC). To install it, type in Stata's Command window:
ssc install rangestat
For the problem at hand, this requires finding the minimum bank_exityear across all observations of the same firmid whose year is one less than the year of the current observation:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
rangestat (min) bank_exityear, interval(year -1 -1) by(firmid)
list
and the results:
. list, sepby(firmid)
+-----------------------------------------------------+
| year firmid bankid bank_e~r want bank_e~n |
|-----------------------------------------------------|
1. | 2008 1 1 2008 . . |
2. | 2008 1 2 2015 . . |
3. | 2009 1 2 2015 2008 2008 |
4. | 2009 1 3 2015 2008 2008 |
5. | 2010 1 2 2015 2015 2015 |
6. | 2010 1 3 2015 2105 2015 |
+-----------------------------------------------------+
This sort of strategy might do the trick:
clear
input year firmid bankid bank_exityear want
2008 1 1 2008 .
2008 1 2 2015 .
2009 1 2 2015 2008
2009 1 3 2015 2008
2010 1 2 2015 2015
2010 1 3 2015 2105
end
tempfile min_year
preserve
collapse (min) want2 = bank_exityear, by(firmid year)
save `min_year'
restore
replace year = year - 1
merge m:1 firmid year using "`min_year'", nogen keep(master match)
replace year = year + 1
This assumes that there are no gaps in year.
your question is a little bit unclear but I believe some combination of
bysort bank_id (year) : gen lag_exit = bank_exit_year[_n-1]
bysort bank_id : egen min_var = min(lag_exit )
should work