properties of households from individual data - stata

I want to create new variable HHage which is the age of head of household reported by HID. In the dataset, the head of household is coded by P1. The dataset looks like this:
Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
I tried the egen command but I get an error pertaining to numlist. The command I used was:
egen hhage = anyvalue(age), values(integer 1,2 to 26)

// create the example data
clear
input ///
str2 Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
end
// check whether there is only 1 household head per household
bys HID : gen byte flag = -(Personid == "P1")
bys HID (flag): replace flag = sum(flag)
assert flag == -1
drop flag
// create hhage
gen hhage = Age if Personid == "P1"
bys HID (hhage): replace hhage = sum(hhage)
list , sepby(HID)

The excellent answer from #Maarten Buis explains that you can do this without egen. This answer focuses on using egen for this kind of problem.
What is allowed as a numlist is a minor issue here; the major issue is that the egen function anyvalue() is of little help. Its documentation explains that
anyvalue(varname), values(integer numlist) may not be combined with by. It takes the value of varname if varname is equal to any integer value in a supplied numlist and is missing otherwise.
This would be legal syntax
egen hhage = anyvalue(age), values(1/26)
but Stata would copy ages 1 to 26 to the new variable and ignore the others, observation by observation, regardless of household and who is head of household. That is not what you want.
One egen solution for this might be
egen hhage = total(age * (Personid == "P1")), by(HHID)
The expression Personid == "P1" evaluates to 1 when true and 0 when false. So the age of the household head appears in the total and other values of age are ignored in so far as they contribute 0 to the total.
The by() option is undocumented but will work. Stata encourages you to do this instead:
bysort HHID : egen hhage = tota(age * (Personid == "P1"))
This solution assumes that
Personid is a string variable. If it is a numeric variable, the expression Personid == "P1" should be replaced by something like Personid == 1 using 1 or whatever other integer code is appropriate.
There is one head of household per household. That can be checked directly by something like
egen hhcount = total(Personid == "P1"), by(HHID)
See also http://www.stata-journal.com/article.html?article=dm0055 for a review of technique in this territory.
Note that in principle you could go something like
egen work = anyvalue(age) if Personid == "P1", values(0/200)
allowing any age imaginable so long as the person is head of household. Then you could fix that by
egen hhage = total(work), by(HHID)
However, I can see no point in that solution.

Related

Stata using if condition with _n under by and egen commands

In Stata's auto data the following command creates all missing values: why?
bysort mpg: egen n1 = mean(price) if rep78[_n]!=rep78
For example take the 14 mpg group:
price mpg rep78
11385 14 3
14500 14 2
6303 14 4
12990 14
5379 14 4
13466 14 3
I expected that n1 for the first row will be mean(14500,6303,12990,5379). Basically I want the mean after excluding the first and last rows because for them we have rep78[_n]==rep78 (equals 3). But instead, I get all missing values.
The subscript [_n] is harmless but vacuous here as referring to the current observation. So the condition is just equivalent to rep78 != rep78 or rep78[_n] != rep78[_n] -- which is never true and so no observations satisfy the condition and the mean is returned as missing.
You're hoping or imagining that the prefix by: implies comparisons within a group, but at best that works only if subscripts are explicit and different.
This works for your problem:
sysuse auto, clear
gen wanted = .
quietly forval i = 1/`=_N' {
su price if mpg == mpg[`i'] & rep78 != rep78[`i'], meanonly
replace wanted = r(mean) in `i'
}
There may be a way to do this with rangestat or rangerun from SSC, or otherwise, in which case a better solution may follow.
EDIT: The OP's code suggestion in comments
bysort mpg rep78: egen sum_m_r_price = sum(price)
bysort mpg rep78: egen count_m_r_price = count(price)
bysort mpg: egen sum_r_price = sum(price)
bysort mpg: egen count_r_price = count(price)
gen b_wanted = ( sum_r_price-sum_m_r_price)/ (count_r_price-count_m_r_price)
appears equivalent.
In reverse, this should be faster than that:
rangestat (sum) sum2=price (count) count2=price, i(rep78 0 0) by(mpg)
rangestat (sum) sum1=price (count) count1=price, i(mpg 0 0)
gen double wanted = (sum1 - sum2) / (count1 - count2)

Stata Generate New Variable List By Multiplying Var Lists

I have a balanced panel with a set of dummies for 'countries' and observations for several years. I want to generate a new set of variables that assigns a number in the sequence 1:n for each year observation of country i, and 0 for any other observation that is not from country i.
As an example, suppose I have two countries and two years. Below on the left is an example of my database. I want a new set of variables as shown on the right:
*Example of Database Example of Desired Output
*country1 country2 year output1 output2
* 1 0 1 1 0
* 1 0 2 2 0
* 0 1 1 0 1
* 0 1 2 0 2
How can I get the desired output? Intuitively I need to multiply 'country*' by 'year' to get 'output*', but I have been unable to make it work in Stata.
Below is what I tried.
gen output = year * country
* country is ambiguous
gen output = year * country*
* invalid syntax
foreach var in country*{
gen output_`var' = year * `var'
}
* invalid name
Your last attempt almost solved it. The issue with your attempt is that you need to tell Stata that you are passing a varlist for you to be able to use the wildcards * and ?. To be able to use a wildcard in foreach, do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(country1 country2 year)
1 0 1
1 0 2
0 1 1
0 1 2
end
foreach var of varlist country* {
gen `var'_year = year * `var'
}
The full name country1, country2 etc. is stored in `var' so I took the freedom to update the name of the result variables to country1_year, country2_year etc. rather than output_country1, output_country2 etc.
Note that this solution will only work if the country* vars only have the values 1 and 0, no observation has a missing value in any variable country* and no observation have the value 1 in more than one variable country*.

How to use if for each variable in egen anycount

I have a large dataset where each observation represents a household; variables are either households characteristics (location, family name) or characteristics of household members, e.g. age_member1, age_member2, edu_member1, edu_member2 and many many more, for 50 members.
I would like to use any count to find differences among migrants and non migrants, e.g. whether the level of education differs (3 = university). This code finds how many people in the household have a university degree:
egen uni_member = anycount (edu_member*), values(3)
Now I would like to count only those who are migrants, maybe with a if condition:
egen uni_migrant = anycount (edu_member*) if migr_member*=1, values(3)
But this is wrong, because the if must refer to a single variable... any help?
I would advise using reshape to put the data in long form. Working rowwise is possible, but I usually find it more cumbersome. For example:
clear all
set more off
*----- example data -----
input ///
hh uni1 age1 migr1 uni2 age2 migr2 uni3 age3 migr3
1 1 23 0 0 54 1 0 38 1
2 0 16 0 1 48 1 0 40 0
end
list
*----- what you want -----
reshape long uni age migr, i(hh) j(member)
bysort hh: egen counthh = total(uni == 1 & migr == 1)
list, sepby(hh)
Which gives that household 1 has one member that is both a migrant and has university education. You can reshape back to a wide format if you need to. See help reshape.
If you insist on working rowwise you can start with Speaking Stata: Rowwise, by Nick Cox.
Following on Roberto Ferrer's answer this would seem to yield easily to a loop:
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) * (migr_member`j' == 1)
}
Note that this should not be
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + (edu_member`j' == 3) if migr_member`j' == 1
}
as values of uni_migrant for observations not matching the if condition would just be set to missing.
An alternative is
gen uni_migrant = 0
qui forval j = 1/50 {
replace uni_migrant = uni_migrant + cond(migr_member`j' == 1, (edu_member`j' == 3), 0)
}

Stata: First occurrences, sum of unique occurrences with a by variable

The following sample data has variables describing bets by a number of players.
How can I calculate each player's first bettype, first betprice, the number of soccer bets, the number of baseball bets, the number of unique prices per customer and the number of unique bet types per username?
clear
input str16 username str40 betdate stake str16 bettype betprice str16 sport
player1 "12NOV2008 12:04:33" 90 SGL 5 SOCCER
player1 "04NOV2008:09:03:44" 30 SGL 4 SOCCER
player2 "07NOV2008:14:03:33" 120 SGL 5 SOCCER
player1 "05NOV2008:09:00:00" 50 SGL 4 SOCCER
player1 "05NOV2008:09:05:00" 30 DBL 3 BASEBALL
player1 "05NOV2008:09:00:05" 20 DBL 4 BASEBALL
player2 "09NOV2008:10:05:10" 10 DBL 5 BASEBALL
player2 "15NOV2008:15:05:33" 35 DBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 TBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 SGL 4 BASEBALL
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
generate double dateonly=date(betdate,"DMY hms")
format dateonly %td
generate firsttype
generate firstprice
generate soccercount
generate baseballcount
generate uniquebettypecount
generate uniquebetpricecount
This is a bit close to the margin, as a "please give me the code" question, with no attempt at your own solutions.
The first type and price are
bysort username (timestamp) : gen firsttype = bettype[1]
bysort username (timestamp) : gen firstprice = betprice[1]
The number of soccer and baseball bets is
egen soccercount = total(sport == "SOCCER"), by(username)
egen baseballcount = total(sport == "BASEBALL"), by(username)
The number of distinct [not unique!] bet types is
bysort username bettype : gen work = _n == 1
egen uniquebettypecount = total(work), by(username)
and the other problem is just the same (but replace work). Another way to do that is
egen work = tag(username bettype)
egen uniquebettypecount = total(work), by(username)
What is characteristic of all these variables is that the same value is repeated for all values within each group. For example, firsttype has the same value for each occurrence of each distinct username. Often you will want to use each value just once. A key to that is the egen function tag() just used, for example
egen usertag = tag(username)
followed by uses of if usertag when needed. (if usertag is a useful idiom for if usertag == 1.)
Some reading suggestions:
On by: http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
On egen: http://www.stata.com/help.cgi?egen
On distinct observations (and why the word "unique" is misleading): http://www.stata-journal.com/sjpdf.html?articlenum=dm0042

How to get average number of observations per group?

In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?
I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.
Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count() is similar, but not identical.
I assume you can identify the different hometeams with some id variable.
If you want the average number of observations per id this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Option 2: Using the data of #Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly