Stata using if condition with _n under by and egen commands - stata

In Stata's auto data the following command creates all missing values: why?
bysort mpg: egen n1 = mean(price) if rep78[_n]!=rep78
For example take the 14 mpg group:
price mpg rep78
11385 14 3
14500 14 2
6303 14 4
12990 14
5379 14 4
13466 14 3
I expected that n1 for the first row will be mean(14500,6303,12990,5379). Basically I want the mean after excluding the first and last rows because for them we have rep78[_n]==rep78 (equals 3). But instead, I get all missing values.

The subscript [_n] is harmless but vacuous here as referring to the current observation. So the condition is just equivalent to rep78 != rep78 or rep78[_n] != rep78[_n] -- which is never true and so no observations satisfy the condition and the mean is returned as missing.
You're hoping or imagining that the prefix by: implies comparisons within a group, but at best that works only if subscripts are explicit and different.
This works for your problem:
sysuse auto, clear
gen wanted = .
quietly forval i = 1/`=_N' {
su price if mpg == mpg[`i'] & rep78 != rep78[`i'], meanonly
replace wanted = r(mean) in `i'
}
There may be a way to do this with rangestat or rangerun from SSC, or otherwise, in which case a better solution may follow.
EDIT: The OP's code suggestion in comments
bysort mpg rep78: egen sum_m_r_price = sum(price)
bysort mpg rep78: egen count_m_r_price = count(price)
bysort mpg: egen sum_r_price = sum(price)
bysort mpg: egen count_r_price = count(price)
gen b_wanted = ( sum_r_price-sum_m_r_price)/ (count_r_price-count_m_r_price)
appears equivalent.
In reverse, this should be faster than that:
rangestat (sum) sum2=price (count) count2=price, i(rep78 0 0) by(mpg)
rangestat (sum) sum1=price (count) count1=price, i(mpg 0 0)
gen double wanted = (sum1 - sum2) / (count1 - count2)

Related

use xtile by year using weights

I have data with income variable, with weight, and I want to calculate the 5% quantiles by year.
Is there a way to do that?
For the weight I can use regular xtile:
xtile quan = salary [aw=weight], n(20)
And for the years I can use xtile from egenmore:
egen quan = xtile(salary), by(year) nq(20)
But how can I do it for weights and by year together?
There is a weights() option, as stated in help egenmore:
clear
set more off
sysuse auto
keep mpg foreign weight
// egenmore
egen mpg4 = xtile(mpg), by(foreign) nq(4) weights(weight)
// compare with xtile
xtile mpg4_1 = mpg [aweight=weight] if foreign, nq(4)
xtile mpg4_2= mpg [aweight=weight] if !foreign, nq(4)
egen mpg42 = rowtotal(mpg4_1 mpg4_2)
assert mpg4 == mpg42
sort foreign mpg weight
list, sepby(foreign)
In the ado-file for egen's xtile function, you can check how weights are set:
if "`weights'" ~= "" {
local weight "[aw = `weights']"
}
See viewsource _gxtile.ado.

Refer to iteration number inside for-loop in Stata

Like many others, I often loop through variables in Stata, running some estimation command and then extracting the results to a variable created to hold them. This is simple when the variables are numbered sequentially or in some pattern (e.g. even numbers in a set). As an example:
sysuse auto
gen var1 = uniform()
gen var2 = uniform()
gen var3 = uniform()
*Create variables to hold results
gen str4 varname=""
gen results=.
*Loop through three variables
foreach i of numlist 1/3{
replace varname="var`i'" in `i'
sum var`i'
replace results=r(mean) in `i'
}
However, I often want to do something similar when the variables are not numeric and/or are not in an easy-to-handle order. Let's say I wanted to do the same thing for price, mpg, weight and length in the auto dataset. If we set up the for-loop as:
sysuse auto
gen str4 varname=""
gen results=.
foreach var of varlist price mpg weight length{
sum `var'
*Place values, in order, in rows?
}
then we need some way to understand that price is the first variable in the list, so its results should go in row 1 (or its name in row 1, or whatever we want to do).
This must be possible, but I would appreciate some suggestions. A clean/non-hackish way would be ideal, as I will be doing this a lot.
You can use a local counter that you start at 1 and increment at the end of each iteration:
sysuse auto, clear
gen varname=""
gen mean=.
local i=1
foreach var of varlist price mpg weight {
quietly sum `var'
replace mean = r(mean) in `i'
replace varname = "`var'" in `i'
local ++i
}
You could also do this. It's unlikely to seem as direct or simple as the standard technique explained by #Dimitriy V. Masterov, but it has its uses.
sysuse auto, clear
gen varname = ""
gen mean = .
local nvars : word count price mpg weight
tokenize "price mpg weight"
quietly forval j = 1/`nvars' {
sum ``j'', meanonly
replace mean = r(mean) in `j'
replace varname = "``j''" in `j'
}
The general points are
Words are separated by spaces, except that double quotation marks and compound double quotation marks bind tighter. Thus a, b and c are unsurprisingly the words in a b c but there are just two words in Stata "is great"
You can count how many objects you are looping over. It is the number of words in a string.
Applying tokenize to an argument string maps the separate words of that argument to local macros named 1, 2 and so forth. The nested macro references that is likely to imply are interpreted just as you would guess from elementary algebra: the innermost argument is evaluated first.
For more complicated problems, including the unpacking of a varlist, check out also unab.

Create a variable by dividing a variable by IQR in Stata

How could I create a variable by dividing it by an IQR? I have done it through a long way as follows.
Sample data and code is the following:
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
foreach var of varlist read-socst {
egen `var'75 = pctile(`var'), p(75)
egen `var'25 = pctile(`var'), p(25)
gen `var'q =`var'75 - `var'25
drop `var'75 `var'25
}
gen readI = read/readq
gen sciI = science/scienceq
The simplest way is just to use summarize results directly:
sysuse auto, clear
quietly foreach v of var price-foreign {
su `v', detail
gen `v'q = `v' / (r(p75) - r(p25))
}
The egen route is overkill if it means creating new variables for each original variable, just to hold the quartiles or the IQR as repeated constants. But egen comes into its own when you want to do this by groups:
bysort foreign: egen mpg_upq = pctile(mpg), p(75)
by foreign: egen mpg_loq = pctile(mpg), p(25)
gen mpg_Q = mpg / (mpg_upq - mpg_loq)
Note that the IQR can be 0, and will often be 0 for indicator variables.

How to use if for each variable in egen anycount

I have a large dataset where each observation represents a household; variables are either households characteristics (location, family name) or characteristics of household members, e.g. age_member1, age_member2, edu_member1, edu_member2 and many many more, for 50 members.
I would like to use any count to find differences among migrants and non migrants, e.g. whether the level of education differs (3 = university). This code finds how many people in the household have a university degree:
egen uni_member = anycount (edu_member*), values(3)
Now I would like to count only those who are migrants, maybe with a if condition:
egen uni_migrant = anycount (edu_member*) if migr_member*=1, values(3)
But this is wrong, because the if must refer to a single variable... any help?
I would advise using reshape to put the data in long form. Working rowwise is possible, but I usually find it more cumbersome. For example:
clear all
set more off
*----- example data -----
input ///
hh uni1 age1 migr1 uni2 age2 migr2 uni3 age3 migr3
1 1 23 0 0 54 1 0 38 1
2 0 16 0 1 48 1 0 40 0
end
list
*----- what you want -----
reshape long uni age migr, i(hh) j(member)
bysort hh: egen counthh = total(uni == 1 & migr == 1)
list, sepby(hh)
Which gives that household 1 has one member that is both a migrant and has university education. You can reshape back to a wide format if you need to. See help reshape.
If you insist on working rowwise you can start with Speaking Stata: Rowwise, by Nick Cox.
Following on Roberto Ferrer's answer this would seem to yield easily to a loop:
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) * (migr_member`j' == 1)
}
Note that this should not be
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) if migr_member`j' == 1
}
as values of uni_migrant for observations not matching the if condition would just be set to missing.
An alternative is
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + cond(migr_member`j' == 1, (edu_member`j' == 3), 0)
}

How to get average number of observations per group?

In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.
I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly