Generating income of other members in family - stata

I have a dataset like this.
FamilyID Status personID spouseID HeadID spouse_of_referenceID income
1 Head 1 2 1 2
1 Spouse of head 2 1 1 2
1 Child 3 NA 1 2
2 Head 1 3 1 3
2 Spouse of head 3 1 1 3
For every "child" I want to create a variable "parents' income" which is sum of the income of the head and the income of spouse of head.
I am thinking of something like
bysort family: egen parentsincome = if ??? status==4
because status is 4 if the person is a child.
But I am not sure how to proceed to next. I thought about using _n but I couldn't think of a real solution.

That is a weak data example: 7 variables are declared, but only 6 exemplified, and no use of Stata. "NA" isn't a Stata code for missing. Some engineering was needed to make sense of it. Statalist has advice on preparing data examples that applies here too. advice on Stata data examples
You can just get the totals conditional on a person being head or their spouse directly with egen.
clear
input FamilyID str14 Status personID spouseID HeadID spouse_of_referenceID income
1 "Head" 1 2 1 2 1000
1 "Spouse of head" 2 1 1 2 2000
1 "Child" 3 . 1 2 0
2 "Head" 1 3 1 3 3000
2 "Spouse of head" 3 1 1 3 4000
end
egen HSIncome = total(income / inlist(Status, "Head", "Spouse of head")), by(FamilyID )
list FamilyID Status personID income HSIncome, sepby(FamilyID)
+----------------------------------------------------------+
| FamilyID Status personID income HSIncome |
|----------------------------------------------------------|
1. | 1 Head 1 1000 3000 |
2. | 1 Spouse of head 2 2000 3000 |
3. | 1 Child 3 0 3000 |
|----------------------------------------------------------|
4. | 2 Head 1 3000 7000 |
5. | 2 Spouse of head 3 4000 7000 |
+----------------------------------------------------------+
See e.g. this paper Sections 9 and 10 for a review of technique.
If you're using value labels instead to show status, the code will naturally be different.
The help for egen is explicit that you shouldn't try to use _n in conjunction. This is because egen often sorts the data temporarily so observations may change their order in the dataset.

Related

Computing Unemployment rates by education group from an indicator variable (Stata)

I have the following variable indicating whether an observation is working or unemployed, where 0 indicates working and 1 refers to unemployed.
dataex unemp
input float unemp
0
0
0
0
1
.
1
When I tabulate the variable:
Unemploymen |
t | Freq.
------------+--------------
Employed | 80
Unemployed | 20
Total LF 100
I essentially want to divide 20/100, to obtain a total unemployment variable of 20%. I have done this manually now, but think it is better to automate this as I also want to compute unemployment by different education groups and geographic regions.
gen unemployment_broad = .
replace unemployment_broad = (20/100)*100
The education variable is as follows, where 1 "Less than basic",
2 "Basic",
3 "Secondary",
4 "Higher education",
Is there a way to compute unemployment rate by each education group?
input float educ
2
4
4
4
2
4
1
3
3
3
Using Cybernike's solution, I tried to create a variable showing unemployment by education as follows, but I got an error:
gen unemp_educ = .
replace unemp_educ = bysort educ: summarize unemp
I essentially want to visualize unemployment by education. With something like this:
graph hbar (mean) Unemployment, over(education)
This is because I also intend to replicate the same equation by demographic group, gender, etc.
Your unemployment variable is coded as 0/1. Therefore, you can obtain the proportion unemployed by taking the mean value. You could do this using the summarize command, or using the collapse command. Both of these can be performed by education group.
clear
input unemp educ
0 2
0 4
0 4
0 4
1 2
0 3
1 3
1 1
1 3
end
bysort educ: summarize unemp
collapse (mean) unemp, by(educ)
list
+-----------------+
| educ unemp |
|-----------------|
1. | 1 1 |
2. | 2 .5 |
3. | 3 .6666667 |
4. | 4 0 |
+-----------------+
In response to your edit, you can also save the mean values to the original dataset using:
bysort educ: egen unemp_mean = mean(unemp)
Your code for plotting the data seems to work fine.

Multiple choices in a choice data set

The original data contains information on the consumerid and the cars they purchased.
clear
input consumerid car purchase
6 American 1
6 Japanese 0
6 European 0
7 American 0
7 Japanese 0
7 European 1
7 Korean 1
end
Since this is a purchase data, the data set needs to be expanded in a way to depict the full choice set of cars every time a consumer made a purchase. The final data set should look like this (the screenshot taken from the Stata manual www.stata.com/manuals/cm.pdf on p. 97 in "Example 4: Multiple choices per case"):
I have generated several codes (shown below) to almost get me to where I need but I have trouble generating a single value of purchase=1 per consumerid-carnumber combination (i.e. due to the expansion, the purchase values are duplicated).
egen sumpurchase=total(purchase), by(id)
expand sumpurchase
bysort id car (purchase): gen carnumber=_n
You could use reshape to get all combinations of consumerid/car per car bought. This example assumes that the sort order in the original dataset defines which car is carnumber 1, carnumber 2 etc.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte consumerid str8 car byte purchase
6 "American" 1
6 "Japanese" 0
6 "European" 0
7 "American" 0
7 "Japanese" 0
7 "European" 1
7 "Korean" 1
end
// Generate carnumber
bys consumerid: gen carnumber = cond(purchase != 0, sum(purchase), 0)
// To wide
reshape wide purchase, i(consumerid car) j(carnumber)
// Keep purchased cars only
drop purchase0
// Back to long
reshape long
// Drop if no cars purchased for consumerid/carnumber
bysort consumerid carnumber (purchase) : drop if missing(purchase[1])
// Replace missing with 0 for non-purchased cars
mvencode purchase, mv(0)
// Sort and see results
sort consumerid carnumber car
list, sepby(consumerid carnumber) abbr(14)
Results:
. list, sepby(consumerid carnumber) abbr(14)
+----------------------------------------------+
| consumerid car carnumber purchase |
|----------------------------------------------|
1. | 6 American 1 1 |
2. | 6 European 1 0 |
3. | 6 Japanese 1 0 |
|----------------------------------------------|
4. | 7 American 1 0 |
5. | 7 European 1 1 |
6. | 7 Japanese 1 0 |
7. | 7 Korean 1 0 |
|----------------------------------------------|
8. | 7 American 2 0 |
9. | 7 European 2 0 |
10. | 7 Japanese 2 0 |
11. | 7 Korean 2 1 |
+----------------------------------------------+

Stata: append by ID and time stamp

I have two datasets. One dataset here
contains information on product assortment at grocery store/day level. This data reflects all the products that were available at a store in a given day.
Another data set
contains data on individuals who visited those stores on a given day.
As you can see in screenshot 2 the same person (highlighted, panid=1101758) only bought 2 products: Michelob and Sam Adams in week 1677 2 at store 234140, whereas we know that overall 4 options were available to that individual in that store on that same day, i.e. 2 additional Budweisers (screenshot 1, highlighted obs.)
I need to merge/append these two datasets at the store/day for each individual in a way that the final data set shows that a person made those two purchases and in addition there were two more that were available to that individual at that store/day. Thus, that specific individual will have 4 observations - 2 purchased and 2 more available options. I have various stores, days, and individuals.
input store day brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
input hh store day brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
In the Stata code above you can see that it was another individual who purchased 2 Budweisers. For that individual a similar action has to also take place, where it can be shown that the individual had 4 options to choose from (Michelob, Sam Adams, Budweiser, Budweiser) but they ended up choosing only 2 Budweisers.
Here is the example of the end result I would like to receive:
input hh store day brand choice
1 1 1 "Michelob" 1
1 1 1 "Sam Adams" 1
1 1 1 "Bud" 0
1 1 1 "Bud" 0
1 1 1 "Coors" 0
2 1 1 "Bud" 1
2 1 1 "Bud" 1
2 1 1 "Michelob" 0
2 1 1 "Sam Adams" 0
2 1 1 "Coors" 0
3 1 1 "Coors" 1
3 1 1 "Michelob" 0
3 1 1 "Sam Adams" 0
3 1 1 "Bud" 0
3 1 1 "Bud" 0
Here's one way to do it. It involves creating an indicator for repeated products within store and day, using joinby to create all possible combinations between hh and products by store and day, and finally a merge to get the choice variable.
// Import hh data
clear
input hh store day str9 brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
gen choice = 1
tempfile hh hh_join
save `hh'
// Create dataset for use with joinby to create all possible combinations
// of hh and products per day/store
drop brand n_brand choice
duplicates drop
save `hh_join'
// Import store data
clear
input store day str9 brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
// Create all possible combinations of hh and products per day/store
joinby store day using `hh_join'
order hh store day brand n_brand
sort hh store day brand n_brand
// Merge with hh data to get choice variable
merge 1:1 hh store day brand n_brand using `hh'
drop _merge
// Replace choice with 0 if missing
replace choice = 0 if missing(choice)
list, noobs sepby(hh)
And the result:
. list, noobs sepby(hh)
+-------------------------------------------------+
| hh store day brand n_brand choice |
|-------------------------------------------------|
| 1 1 1 Bud 1 0 |
| 1 1 1 Bud 2 0 |
| 1 1 1 Coors 1 0 |
| 1 1 1 Michelob 1 1 |
| 1 1 1 Sam Adams 1 1 |
|-------------------------------------------------|
| 2 1 1 Bud 1 1 |
| 2 1 1 Bud 2 1 |
| 2 1 1 Coors 1 0 |
| 2 1 1 Michelob 1 0 |
| 2 1 1 Sam Adams 1 0 |
|-------------------------------------------------|
| 3 1 1 Bud 1 0 |
| 3 1 1 Bud 2 0 |
| 3 1 1 Coors 1 1 |
| 3 1 1 Michelob 1 0 |
| 3 1 1 Sam Adams 1 0 |
+-------------------------------------------------+

Which function I can use in Stata to replicate a quantitative variable?

I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.
Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.

Stata: how to duplicate observations under certain conditions

Please help me duplicate a variable under certain conditions? My original dataset looks like this:
week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
This table says that for each week, there is a unique average price for each category of goods.
I need to create the following variables:
averageprice1 (av. price for category 1)
averageprice2 (av. price for category 2)
such that:
week category averageprice1 averageprice2
1 1 5 6
1 2 5 6
2 1 4 7
2 2 4 7
meaning that for week 1, average price for category 1 stayed at $5, and av. price for cater 2 stayed at 6. Similar logic applies to week 2.
As you could see that the new variables are duplicated depending on a week.
I am still learning Stata. I tried:
bysort week: replace averageprice1=averageprice if categ==1
but it doesn't work as expected.
You are not duplicating observations (meaning here in the Stata sense, i.e. cases or records) here at all, as (1) the number of observations remains the same (2) you are copying certain values, not the contents of observations. Similar comment on "duplicating variables". However, that's just loose use of terminology.
Taking your example very literally
clear
input week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
end
bysort week (category) : gen averageprice1 = averageprice[1]
by week: gen averageprice2 = averageprice[2]
l
+--------------------------------------------------+
| week category averag~e averag~1 averag~2 |
|--------------------------------------------------|
1. | 1 1 5 5 6 |
2. | 1 2 6 5 6 |
3. | 2 1 4 4 7 |
4. | 2 2 7 4 7 |
+--------------------------------------------------+
This is a standard application of subscripting with by:. Your code didn't work because it did not oblige Stata to look in other observations when that is needed. In fact your use of bysort week did not affect how the code applied at all.
EDIT:
A generalization is
egen averageprice1 = mean(averageprice / (category == 1)), by(week)
egen averageprice2 = mean(averageprice / (category == 2)), by(week)