How to get average number of observations per group? - stata

In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.

bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.

I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)

Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly

Related

Stata using if condition with _n under by and egen commands

In Stata's auto data the following command creates all missing values: why?
bysort mpg: egen n1 = mean(price) if rep78[_n]!=rep78
For example take the 14 mpg group:
price mpg rep78
11385 14 3
14500 14 2
6303 14 4
12990 14
5379 14 4
13466 14 3
I expected that n1 for the first row will be mean(14500,6303,12990,5379). Basically I want the mean after excluding the first and last rows because for them we have rep78[_n]==rep78 (equals 3). But instead, I get all missing values.
The subscript [_n] is harmless but vacuous here as referring to the current observation. So the condition is just equivalent to rep78 != rep78 or rep78[_n] != rep78[_n] -- which is never true and so no observations satisfy the condition and the mean is returned as missing.
You're hoping or imagining that the prefix by: implies comparisons within a group, but at best that works only if subscripts are explicit and different.
This works for your problem:
sysuse auto, clear
gen wanted = .
quietly forval i = 1/`=_N' {
su price if mpg == mpg[`i'] & rep78 != rep78[`i'], meanonly
replace wanted = r(mean) in `i'
}
There may be a way to do this with rangestat or rangerun from SSC, or otherwise, in which case a better solution may follow.
EDIT: The OP's code suggestion in comments
bysort mpg rep78: egen sum_m_r_price = sum(price)
bysort mpg rep78: egen count_m_r_price = count(price)
bysort mpg: egen sum_r_price = sum(price)
bysort mpg: egen count_r_price = count(price)
gen b_wanted = ( sum_r_price-sum_m_r_price)/ (count_r_price-count_m_r_price)
appears equivalent.
In reverse, this should be faster than that:
rangestat (sum) sum2=price (count) count2=price, i(rep78 0 0) by(mpg)
rangestat (sum) sum1=price (count) count1=price, i(mpg 0 0)
gen double wanted = (sum1 - sum2) / (count1 - count2)

Stata Generate New Variable List By Multiplying Var Lists

I have a balanced panel with a set of dummies for 'countries' and observations for several years. I want to generate a new set of variables that assigns a number in the sequence 1:n for each year observation of country i, and 0 for any other observation that is not from country i.
As an example, suppose I have two countries and two years. Below on the left is an example of my database. I want a new set of variables as shown on the right:
*Example of Database Example of Desired Output
*country1 country2 year output1 output2
* 1 0 1 1 0
* 1 0 2 2 0
* 0 1 1 0 1
* 0 1 2 0 2
How can I get the desired output? Intuitively I need to multiply 'country*' by 'year' to get 'output*', but I have been unable to make it work in Stata.
Below is what I tried.
gen output = year * country
* country is ambiguous
gen output = year * country*
* invalid syntax
foreach var in country*{
gen output_`var' = year * `var'
}
* invalid name
Your last attempt almost solved it. The issue with your attempt is that you need to tell Stata that you are passing a varlist for you to be able to use the wildcards * and ?. To be able to use a wildcard in foreach, do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(country1 country2 year)
1 0 1
1 0 2
0 1 1
0 1 2
end
foreach var of varlist country* {
gen `var'_year = year * `var'
}
The full name country1, country2 etc. is stored in `var' so I took the freedom to update the name of the result variables to country1_year, country2_year etc. rather than output_country1, output_country2 etc.
Note that this solution will only work if the country* vars only have the values 1 and 0, no observation has a missing value in any variable country* and no observation have the value 1 in more than one variable country*.

Computing running sum with moving time-window

My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).
Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}

Generating panel data in Stata

How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha

How to use if for each variable in egen anycount

I have a large dataset where each observation represents a household; variables are either households characteristics (location, family name) or characteristics of household members, e.g. age_member1, age_member2, edu_member1, edu_member2 and many many more, for 50 members.
I would like to use any count to find differences among migrants and non migrants, e.g. whether the level of education differs (3 = university). This code finds how many people in the household have a university degree:
egen uni_member = anycount (edu_member*), values(3)
Now I would like to count only those who are migrants, maybe with a if condition:
egen uni_migrant = anycount (edu_member*) if migr_member*=1, values(3)
But this is wrong, because the if must refer to a single variable... any help?
I would advise using reshape to put the data in long form. Working rowwise is possible, but I usually find it more cumbersome. For example:
clear all
set more off
*----- example data -----
input ///
hh uni1 age1 migr1 uni2 age2 migr2 uni3 age3 migr3
1 1 23 0 0 54 1 0 38 1
2 0 16 0 1 48 1 0 40 0
end
list
*----- what you want -----
reshape long uni age migr, i(hh) j(member)
bysort hh: egen counthh = total(uni == 1 & migr == 1)
list, sepby(hh)
Which gives that household 1 has one member that is both a migrant and has university education. You can reshape back to a wide format if you need to. See help reshape.
If you insist on working rowwise you can start with Speaking Stata: Rowwise, by Nick Cox.
Following on Roberto Ferrer's answer this would seem to yield easily to a loop:
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) * (migr_member`j' == 1)
}
Note that this should not be
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) if migr_member`j' == 1
}
as values of uni_migrant for observations not matching the if condition would just be set to missing.
An alternative is
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + cond(migr_member`j' == 1, (edu_member`j' == 3), 0)
}