I am working on a test score database and want to build two observations. The data has rows English score, Math score, Rank and a unique id for each kid. Let's call the kid for whom we are building the observations "focal kid"
Obs 1 average English score of all kids ranked below focal kid whose Math score is above that of the focal kid
Obs 2 average English score of all kids ranked below focal kid whose Math score is below that of the focal kid's
Please help me write this code without loops if possible. ( I have about 100k observations)
Update 1 I am building these observations for each kid and not just one kid.
Loops!
* toy dataset
clear
set obs 5
set seed 2803
gen id = _n
gen rnd = runiform()
sort rnd
gen rank = _n
gen math = 100 * runiform()
gen english = 100 * runiform()
* code for real
gen math_above = .
gen math_below = .
sort rank
forval j = 2/`=_N' {
local J = `j' - 1
su english if math > math[`j'] in 1/`J', meanonly
replace math_above = r(mean) in `j'
su english if math < math[`j'] in 1/`J', meanonly
replace math_below = r(mean) in `j'
}
Related
In Stata's auto data the following command creates all missing values: why?
bysort mpg: egen n1 = mean(price) if rep78[_n]!=rep78
For example take the 14 mpg group:
price mpg rep78
11385 14 3
14500 14 2
6303 14 4
12990 14
5379 14 4
13466 14 3
I expected that n1 for the first row will be mean(14500,6303,12990,5379). Basically I want the mean after excluding the first and last rows because for them we have rep78[_n]==rep78 (equals 3). But instead, I get all missing values.
The subscript [_n] is harmless but vacuous here as referring to the current observation. So the condition is just equivalent to rep78 != rep78 or rep78[_n] != rep78[_n] -- which is never true and so no observations satisfy the condition and the mean is returned as missing.
You're hoping or imagining that the prefix by: implies comparisons within a group, but at best that works only if subscripts are explicit and different.
This works for your problem:
sysuse auto, clear
gen wanted = .
quietly forval i = 1/`=_N' {
su price if mpg == mpg[`i'] & rep78 != rep78[`i'], meanonly
replace wanted = r(mean) in `i'
}
There may be a way to do this with rangestat or rangerun from SSC, or otherwise, in which case a better solution may follow.
EDIT: The OP's code suggestion in comments
bysort mpg rep78: egen sum_m_r_price = sum(price)
bysort mpg rep78: egen count_m_r_price = count(price)
bysort mpg: egen sum_r_price = sum(price)
bysort mpg: egen count_r_price = count(price)
gen b_wanted = ( sum_r_price-sum_m_r_price)/ (count_r_price-count_m_r_price)
appears equivalent.
In reverse, this should be faster than that:
rangestat (sum) sum2=price (count) count2=price, i(rep78 0 0) by(mpg)
rangestat (sum) sum1=price (count) count1=price, i(mpg 0 0)
gen double wanted = (sum1 - sum2) / (count1 - count2)
I make a lot of graphs comparing two groups (e.g., male/female) across a number of variables. The standard -graph bar- output groups all bars for men together, and all bars for women together. I am hoping to find a simple way to make bar graphs that group bars first by the target variable (i.e. the variables being graphed), and then by the -over- variable, such as gender.
I have a method for doing this, but it is quite cumbersome. See illustration below.
*Set seed + obs
clear
set seed 442
set obs 100
*Generate two outcomes
gen x1 = uniform()
gen x2 = uniform()
*Generate crossing variable
gen gender = 0 in 1/50
replace gender = 1 in 51/100
label define gender_lab 0 "Male" 1 "Female"
label values gender gender_lab
*Extract means by gender
gen b_male = .
gen b_female = .
sum x1 if gender == 0
replace b_male = r(mean) in 1
sum x1 if gender == 1
replace b_female = r(mean) in 1
sum x2 if gender == 0
replace b_male = r(mean) in 2
sum x2 if gender == 1
replace b_female = r(mean) in 2
*Establish order of graph
gen index_male = _n*3 in 1/2
gen index_female = (_n*3) + 1 in 1/2
*This is what -graph bar- produces naturally
graph bar x1 x2, over(gender)
*This is closer to what I want
twoway bar b_male index_male || bar b_female index_female, xlabel(3.5 "x1" 6.5 "x2", notick labgap(4)) xmlabel(3 "Male" 4 "Female" 6 "Male" 7 "Female") legend(off)
Is there a simple way to use graph bar but still establish the sort order I want? I produce dozens of these graphs per day sometimes, so I want to avoid unnecessary steps as much as possible.
This is a model question: thank you very much!
I'll first copy your code, with some small simplifications which may be of interest any way.
*Set seed + obs
clear
set seed 442
set obs 100
*Generate two outcomes
gen x1 = runiform()
gen x2 = runiform()
*Generate crossing variable
gen gender = _n > 50
label define gender_lab 0 "Male" 1 "Female"
label values gender gender_lab
*Extract means by gender
sum x1 if gender == 0
gen b_male = r(mean) in 1
sum x1 if gender == 1
gen b_female = r(mean) in 1
sum x2 if gender == 0
replace b_male = r(mean) in 2
sum x2 if gender == 1
replace b_female = r(mean) in 2
*Establish order of graph
gen index_male = _n*3 in 1/2
gen index_female = (_n*3) + 1 in 1/2
*This is what -graph bar- produces naturally
graph bar x1 x2, over(gender) name(G1)
*This is closer to what I want
twoway bar b_male index_male || bar b_female index_female, ///
xlabel(3.5 "x1" 6.5 "x2", notick labgap(4)) ///
xmlabel(3 "Male" 4 "Female" 6 "Male" 7 "Female") legend(off) name(G2)
The good news is that there is a one-line solution once you have installed statplot by Eric A. Booth and myself from SSC. (The email address for Eric is the help file is no longer current.)
ssc inst statplot
statplot x1 x2, over(gender)
statplot x1 x2, over(gender) recast(bar)
statplot x1 x2, over(gender) recast(bar) asyvars yla(, ang(h)) ///
bar(2, bcolor(orange*0.8)) bar(1, bcolor(blue*0.8))
Here is the last graph to show what is done.
statplot defaults to means, what is what you show, so you don't have to calculate means. Other statistics are available.
How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha
I have a large dataset where each observation represents a household; variables are either households characteristics (location, family name) or characteristics of household members, e.g. age_member1, age_member2, edu_member1, edu_member2 and many many more, for 50 members.
I would like to use any count to find differences among migrants and non migrants, e.g. whether the level of education differs (3 = university). This code finds how many people in the household have a university degree:
egen uni_member = anycount (edu_member*), values(3)
Now I would like to count only those who are migrants, maybe with a if condition:
egen uni_migrant = anycount (edu_member*) if migr_member*=1, values(3)
But this is wrong, because the if must refer to a single variable... any help?
I would advise using reshape to put the data in long form. Working rowwise is possible, but I usually find it more cumbersome. For example:
clear all
set more off
*----- example data -----
input ///
hh uni1 age1 migr1 uni2 age2 migr2 uni3 age3 migr3
1 1 23 0 0 54 1 0 38 1
2 0 16 0 1 48 1 0 40 0
end
list
*----- what you want -----
reshape long uni age migr, i(hh) j(member)
bysort hh: egen counthh = total(uni == 1 & migr == 1)
list, sepby(hh)
Which gives that household 1 has one member that is both a migrant and has university education. You can reshape back to a wide format if you need to. See help reshape.
If you insist on working rowwise you can start with Speaking Stata: Rowwise, by Nick Cox.
Following on Roberto Ferrer's answer this would seem to yield easily to a loop:
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) * (migr_member`j' == 1)
}
Note that this should not be
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) if migr_member`j' == 1
}
as values of uni_migrant for observations not matching the if condition would just be set to missing.
An alternative is
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + cond(migr_member`j' == 1, (edu_member`j' == 3), 0)
}
In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.
I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly