Stata groupby average and first observation give wrong number - stata

I have a city-year level panel data that looks like:
city year mayor growth
Orange 2001 A 9.599998
Orange 2002 A 14.9
Orange 2003 A 14.6
Orange 2004 A 13
Orange 2005 B 9
Orange 2006 B 12.7
Orange 2007 C 18.4
Orange 2008 D 20.7
Orange 2009 D 16.5
I want to calculate for each mayor in each city:
(1) the growth in his first year as the mayor
(2) the rolling average growth since his first year
My code is:
bysort city mayor (year) : gen rollavg = sum(growth) / sum(growth < .) if mayor != "" & growth != .
bysort city mayor (year) : gen year1growth = growth[1]
It works for most of the data, but for some city/mayor Stata returns random numbers:
city year mayor growth year1growth rollavg
Orange 2001 A 9.599998 10345.59 10345.59
Orange 2002 A 14.9 10345.59 5180.245
Orange 2003 A 14.6 10345.59 3458.363
Orange 2004 A 13 10345.59 .
Orange 2005 B 9 9 9
Orange 2006 B 12.7 9 10.85
Orange 2007 C 18.4 18.4 18.4
Orange 2008 D 20.7 20.7 20.7
Orange 2009 D 16.5 20.7 18.6
Orange 2010 D 12.5 20.7 16.56667
For example, it works for mayor D: Year1Growth = 20.7 which is the growth rate in his first year 2008. Rolling average also works, 18.6 = (20.7+16.5)/2 and 16.56 = (20.7+16.5+12.5)/3.
However, the numbers are totally wrong for mayor A.
Does anyone know how to fix this?

I can't reproduce this. I note that your example is inconsistent as your variable names for your data example and for your results are not identical. (Corrected in an edit to the original question: my data example and results are consistent in names, but different.)
My only guesses are that
There is some confusion about what is in what variable in your real dataset.
Although you are showing us what look like numeric variables, some or all of those variables may have been produced by an encode of an original string variable that contained numbers. That is a known way to produce complete garbage. The advice is always to use destring, not encode.
The if qualifier is irrelevant to this example.
* Example generated by -dataex-. For more info, type help dataex
clear
input str6 City int Year str1 Mayor double Growth
"Orange" 2001 "A" 9.599998
"Orange" 2002 "A" 14.9
"Orange" 2003 "A" 14.6
"Orange" 2004 "A" 13
"Orange" 2005 "B" 9
"Orange" 2006 "B" 12.7
"Orange" 2007 "C" 18.4
"Orange" 2008 "D" 20.7
"Orange" 2009 "D" 16.5
end
bysort City Mayor (Year) : gen rollavg = sum(Growth) / sum(Growth < .)
bysort City Mayor (Year) : gen year1growth = Growth[1]
list, sepby(Mayor)
+--------------------------------------------------------+
| City Year Mayor Growth rollavg year1g~h |
|--------------------------------------------------------|
1. | Orange 2001 A 9.599998 9.599998 9.599998 |
2. | Orange 2002 A 14.9 12.25 9.599998 |
3. | Orange 2003 A 14.6 13.03333 9.599998 |
4. | Orange 2004 A 13 13.025 9.599998 |
|--------------------------------------------------------|
5. | Orange 2005 B 9 9 9 |
6. | Orange 2006 B 12.7 10.85 9 |
|--------------------------------------------------------|
7. | Orange 2007 C 18.4 18.4 18.4 |
|--------------------------------------------------------|
8. | Orange 2008 D 20.7 20.7 20.7 |
9. | Orange 2009 D 16.5 18.6 20.7 |
+--------------------------------------------------------+

Related

Filter Specific Data in Stata

I'm using Stata 13 and have to clean a data set in a panel format with different ids for a given period from 2000 to 2003. My data looks like:
id year ln_wage
1 2000 2.30
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.10
3 2002 1.60
4 2002 2.46
4 2003 2.47
5 2000 2.10
5 2001 2.10
5 2003 2.12
I would like to keep in my dataset for each year only individuals that appear in t-1 year. In this way, the first year of my sample (2000) will be dropped. I'm looking for output like:
2001:
id year ln_wage
1 2001 2.31
5 2001 2.10
2002:
id year ln_wage
1 2002 2.31
2 2002 1.89
2003:
id year ln_wage
2 2003 2.10
4 2003 2.47
Regards,
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int year float ln_wage
1 2000 2.3
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.1
3 2002 1.6
4 2002 2.46
4 2003 2.47
5 2000 2.1
5 2001 2.1
5 2003 2.12
end
xtset id year
drop if missing(L.ln_wage)
sort year id
list, noobs sepby(year)
+---------------------+
| id year ln_wage |
|---------------------|
| 1 2001 2.31 |
| 5 2001 2.1 |
|---------------------|
| 1 2002 2.31 |
| 2 2002 1.89 |
|---------------------|
| 2 2003 2.1 |
| 4 2003 2.47 |
+---------------------+
// Alternatively, assuming no duplicate years within id exist
bysort id (year): gen todrop = year[_n-1] != year - 1
drop if todrop

Generating values for columns between a range

I have the following dataset
A B begin_yr end_yr
asset brown 2007 2010
asset blue 2008 2008
basics caramel 2015 2015
cows dork 2004 2006
I want A and B to have rows for each year represented.
I expanded for each year:
gen x = end_yr - begin_yr
expand x +1
This gives me the following:
A B begin_yr end_yr x
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset brown 2007 2010 3
asset blue 2008 2008 0
basics caramel 2015 2015 0
cows dork 2004 2006 2
Ultimately, I want the following dataset:
A B begin_yr end_yr x year
asset brown 2007 2010 3 2007
asset brown 2007 2010 3 2008
asset brown 2007 2010 3 2009
asset brown 2007 2010 3 2010
asset blue 2008 2008 0 2008
basics caramel 2015 2015 0 2015
cows dork 2004 2006 2 2004
cows dork 2004 2006 2 2005
cows dork 2004 2006 2 2006
This is what I have so far:
gen year = begin_yr if begin_yr!=end_yr
How do I populate the rest of the variable year?
Here's a twist building on #Pearly Spencer's code:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
gen toexpand = end - begin + 1
expand toexpand
bysort A : gen year = begin + _n - 1
list, sepby(A)
+--------------------------------------------------------+
| A B begin_yr end_yr toexpand year |
|--------------------------------------------------------|
1. | asset brown 2007 2010 4 2007 |
2. | asset brown 2007 2010 4 2008 |
3. | asset brown 2007 2010 4 2009 |
4. | asset brown 2007 2010 4 2010 |
|--------------------------------------------------------|
5. | basics caramel 2015 2015 1 2015 |
|--------------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------------+
Nothing against tsset or tsfill but neither is needed for this.
The following works for me:
clear
input strL A strL B begin_yr end_yr
asset brown 2007 2010
basics caramel 2015 2015
cows dork 2004 2006
end
generate id = _n
expand 2
clonevar year = begin_yr
bysort id: replace year = end_yr[2] if _n == _N
drop if _n == 3
tsset id year
tsfill
foreach var in A B begin_yr end_yr {
bysort id: replace `var' = `var'[1]
}
list
+--------------------------------------------------+
| A B begin_yr end_yr id year |
|--------------------------------------------------|
1. | asset brown 2007 2010 1 2007 |
2. | asset brown 2007 2010 1 2008 |
3. | asset brown 2007 2010 1 2009 |
4. | asset brown 2007 2010 1 2010 |
5. | basics caramel 2015 2015 2 2015 |
|--------------------------------------------------|
6. | cows dork 2004 2006 3 2004 |
7. | cows dork 2004 2006 3 2005 |
8. | cows dork 2004 2006 3 2006 |
+--------------------------------------------------+

variable showing the highest value attained of another variable, recorded so far, over time

I have a dataset of patients and their alcohol-related patient data over time (in years) like below
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
I would like to create a variable that shows the highest level of alcohol code that has been used so far at any (year) point in a patient's record, such that the dataset would be like below:
clear
input long patid float(year cohort highestsofar)
1051 1994 1 1
2051 1972 1 1
2051 1989 2 2
2051 1990 2 2
2051 2000 2 2
2051 2001 3 3
2051 2002 1 3
2051 2003 2 3
8051 1995 1 1
8051 1996 1 1
8051 2003 1 1
end
label values cohort cohortlab
label values highestsofar cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "lifetime_abstainer" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
Thanks for the clear example and question.
The problem is already covered by an FAQ link here on the StataCorp website. Here's a one-line solution using rangestat from SSC.
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
rangestat (max) highestsofar = cohort, interval(year . 0) by(patid)
list, sepby(patid)
+-------------------------------------------+
| patid year cohort highes~r |
|-------------------------------------------|
1. | 1051 1994 no alcohol data 1 |
|-------------------------------------------|
2. | 2051 1972 no alcohol data 1 |
3. | 2051 1989 indeterminate 2 |
4. | 2051 1990 indeterminate 2 |
5. | 2051 2000 indeterminate 2 |
6. | 2051 2001 non-drinker 3 |
7. | 2051 2002 no alcohol data 3 |
8. | 2051 2003 indeterminate 3 |
|-------------------------------------------|
9. | 8051 1995 no alcohol data 1 |
10. | 8051 1996 no alcohol data 1 |
11. | 8051 2003 no alcohol data 1 |
+-------------------------------------------+
I would like to offer an answer:
by patid: g highestsofar=cohort if cohort>cohort[_n-1]|_n==1
by patid: replace highestsofar=highestsofar[_n-1] if cohort<=cohort[_n-1]&_n>1
by patid: replace highestsofar=highestsofar[_n-1] if (highestsofar<highestsofar[_n-1]) & ((cohort>cohort[_n-1])&_n>1)
label values highestsofar cohortlab
I would be happy if a more compact syntax could be discussed.
Thanks

Reshaping when year and countries are both columns

I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+

Adding observations with specific values for variable

First, have a look at some variables of my dataset:
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
2433 2004 2001 42
2433 2006 2001 1 29
2433 2008 2001 1 100
2434 2002 2002 21
2434 2003 2002 55
2434 2004 2002 22
2434 2005 2002 24
2434 2006 2002 17
2434 2007 2002 40
2434 2008 2002 110
2434 2009 2002 158
2434 2010 2002 38
2435 2002 2002 80
2435 2003 2002 86
2435 2004 2002 877
2435 2005 2002 254
2435 2006 2002 71
2435 2007 2002 116
2435 2008 2002 118
2435 2009 2002 1165
2435 2010 2002 67
2436 2002 1992 24
2436 2003 1992 25
2436 2004 1992 22
2436 2005 1992 23
2436 2006 1992 21
2436 2007 1992 100
2436 2008 1992 73
2436 2009 1992 23
2436 2010 1992 40
2437 2002 2002 30
2437 2003 2002 31
2437 2004 2002 21
2437 2006 2002 1 56
2437 2007 2002 20
The variables:
firm_id is an identifier for firms
year is the year of the observation
dyrstr is the founding year of a firm
Lack equals 1 if there is a missing observation in the year before (e.g. in line three of the dataset, Lack equals 1 because for the firm with ID 2432, there is no observation in the year 2004)
total_workers is the number of workers
I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
*2432* *2004* *1980* *156*
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.
firm_id should stay the same as in the line before
year should be the year from the previous line plus one
dyrstr should stay the same as in the line before
Lack: here it doesn't matter which value this variable has
total_workers equals 0.5*(value of the previous observation + value of consecutive observation)
all other variables of my dataset (which I didn't list here) should stay the same as in the line before
I read something about the the command expand but help expand doesn't help me much. Hopefully one of you can help me!
My suggestions hinge on using expand, which in turn just requires information on the number of observations to be added. I ignore your variable Lack, as Stata itself can work out where the gaps are. My procedure for imputing total_workers is based on using the inbuilt command ipolate and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.
For other interpolation procedures, check out cipolate, csipolate, pchipolate, all accessible via ssc desc cipolate (or equivalent).
This kind of operation depends on getting sort order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list statements; and work on a good toy example dataset (as you kindly provided here).
. clear
. input firm_id year dyrstr total_workers
firm_id year dyrstr total_w~s
1. 2432 2002 1980 29
2. 2432 2003 1980 23
3. 2432 2005 1980 283
4. 2432 2006 1980 56
5. 2432 2007 1980 21
6. 2433 2004 2001 42
7. 2433 2006 2001 29
8. 2433 2008 2001 100
9. 2434 2002 2002 21
10. 2434 2003 2002 55
11. 2434 2004 2002 22
12. 2434 2005 2002 24
13. 2434 2006 2002 17
14. 2434 2007 2002 40
15. 2434 2008 2002 110
16. 2434 2009 2002 158
17. 2434 2010 2002 38
18. 2435 2002 2002 80
19. 2435 2003 2002 86
20. 2435 2004 2002 877
21. 2435 2005 2002 254
22. 2435 2006 2002 71
23. 2435 2007 2002 116
24. 2435 2008 2002 118
25. 2435 2009 2002 1165
26. 2435 2010 2002 67
27. 2436 2002 1992 24
28. 2436 2003 1992 25
29. 2436 2004 1992 22
30. 2436 2005 1992 23
31. 2436 2006 1992 21
32. 2436 2007 1992 100
33. 2436 2008 1992 73
34. 2436 2009 1992 23
35. 2436 2010 1992 40
36. 2437 2002 2002 30
37. 2437 2003 2002 31
38. 2437 2004 2002 21
39. 2437 2006 2002 56
40. 2437 2007 2002 20
41. end
. scalar N = _N
. bysort firm_id (year) : gen gap = year - year[_n-1]
(6 missing values generated)
. expand gap
(6 missing counts ignored; observations not deleted)
(4 observations created)
. gen orig = _n <= scalar(N)
. bysort firm_id (year) : replace total_workers = . if !orig
(4 real changes made, 4 to missing)
. bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1
(4 real changes made)
. bysort firm_id (year): ipolate total_workers year , gen(total_workers2)
. list, sepby(firm_id)
+------------------------------------------------------------+
| firm_id year dyrstr total_~s gap orig total_~2 |
|------------------------------------------------------------|
1. | 2432 2002 1980 29 . 1 29 |
2. | 2432 2003 1980 23 1 1 23 |
3. | 2432 2004 1980 . 2 0 153 |
4. | 2432 2005 1980 283 2 1 283 |
5. | 2432 2006 1980 56 1 1 56 |
6. | 2432 2007 1980 21 1 1 21 |
|------------------------------------------------------------|
7. | 2433 2004 2001 42 . 1 42 |
8. | 2433 2005 2001 . 2 0 35.5 |
9. | 2433 2006 2001 29 2 1 29 |
10. | 2433 2007 2001 . 2 0 64.5 |
11. | 2433 2008 2001 100 2 1 100 |
|------------------------------------------------------------|
12. | 2434 2002 2002 21 . 1 21 |
13. | 2434 2003 2002 55 1 1 55 |
14. | 2434 2004 2002 22 1 1 22 |
15. | 2434 2005 2002 24 1 1 24 |
16. | 2434 2006 2002 17 1 1 17 |
17. | 2434 2007 2002 40 1 1 40 |
18. | 2434 2008 2002 110 1 1 110 |
19. | 2434 2009 2002 158 1 1 158 |
20. | 2434 2010 2002 38 1 1 38 |
|------------------------------------------------------------|
21. | 2435 2002 2002 80 . 1 80 |
22. | 2435 2003 2002 86 1 1 86 |
23. | 2435 2004 2002 877 1 1 877 |
24. | 2435 2005 2002 254 1 1 254 |
25. | 2435 2006 2002 71 1 1 71 |
26. | 2435 2007 2002 116 1 1 116 |
27. | 2435 2008 2002 118 1 1 118 |
28. | 2435 2009 2002 1165 1 1 1165 |
29. | 2435 2010 2002 67 1 1 67 |
|------------------------------------------------------------|
30. | 2436 2002 1992 24 . 1 24 |
31. | 2436 2003 1992 25 1 1 25 |
32. | 2436 2004 1992 22 1 1 22 |
33. | 2436 2005 1992 23 1 1 23 |
34. | 2436 2006 1992 21 1 1 21 |
35. | 2436 2007 1992 100 1 1 100 |
36. | 2436 2008 1992 73 1 1 73 |
37. | 2436 2009 1992 23 1 1 23 |
38. | 2436 2010 1992 40 1 1 40 |
|------------------------------------------------------------|
39. | 2437 2002 2002 30 . 1 30 |
40. | 2437 2003 2002 31 1 1 31 |
41. | 2437 2004 2002 21 1 1 21 |
42. | 2437 2005 2002 . 2 0 38.5 |
43. | 2437 2006 2002 56 2 1 56 |
44. | 2437 2007 2002 20 1 1 20 |
+------------------------------------------------------------+
The following works if, like in your example database, you don't have consecutive years missing for any given firm. I also assume variable Lack to be numeric and the final result is an unbalanced panel (you were not specific about this point in your question).
* Expand database
expand 2 if Lack == 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year - 1 if x == 1
replace total_workers = (total_workers[_n-1] + total_workers[_n+1])/2 if x == 1
list, sepby(firm_id)
The expand line could be re-written as expand Lack + 1, gen(x), but maybe it is clearer that way.
For the more general case in which you do have consecutive years missing, the following should get you started under the assumption that Lack specifies the number of consecutive years missing. For example, if there is a jump from 2006 to 2009 for a given firm, then Lack = 2 for the 2009 observation.
* Expand database
expand Lack + 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year[_n-1] + 1 if x == 1
Now you just need to come up with an imputation rule for your total_workers:
replace total_workers = ...
If Lack is a string, convert to numeric using real.
You've already awarded the answer, but I have had to do similar before and always use the cross command as follows. Say I am using your dataset already & continue with the following code:
tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'
restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted
Then from here, use the ipolate command in the spirit of #NickCox.
I'm particularly interested in any pros/cons regarding the use of expand and cross. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year' tempfile differently.)