Stata: how to duplicate observations under certain conditions

Stata: how to duplicate observations under certain conditions - stata

Please help me duplicate a variable under certain conditions? My original dataset looks like this:
week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
This table says that for each week, there is a unique average price for each category of goods.
I need to create the following variables:
averageprice1 (av. price for category 1)
averageprice2 (av. price for category 2)
such that:
week category averageprice1 averageprice2
1 1 5 6
1 2 5 6
2 1 4 7
2 2 4 7
meaning that for week 1, average price for category 1 stayed at $5, and av. price for cater 2 stayed at 6. Similar logic applies to week 2.
As you could see that the new variables are duplicated depending on a week.
I am still learning Stata. I tried:
bysort week: replace averageprice1=averageprice if categ==1
but it doesn't work as expected.

You are not duplicating observations (meaning here in the Stata sense, i.e. cases or records) here at all, as (1) the number of observations remains the same (2) you are copying certain values, not the contents of observations. Similar comment on "duplicating variables". However, that's just loose use of terminology.
Taking your example very literally
clear
input week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
end
bysort week (category) : gen averageprice1 = averageprice[1]
by week: gen averageprice2 = averageprice[2]
l
+--------------------------------------------------+
| week category averag~e averag~1 averag~2 |
|--------------------------------------------------|
1. | 1 1 5 5 6 |
2. | 1 2 6 5 6 |
3. | 2 1 4 4 7 |
4. | 2 2 7 4 7 |
+--------------------------------------------------+
This is a standard application of subscripting with by:. Your code didn't work because it did not oblige Stata to look in other observations when that is needed. In fact your use of bysort week did not affect how the code applied at all.
EDIT:
A generalization is
egen averageprice1 = mean(averageprice / (category == 1)), by(week)
egen averageprice2 = mean(averageprice / (category == 2)), by(week)

Related

Impute missing categories

I have randomly missing categories in a Stata dataset that look like the following
omb_control_number agency hours
1 HHS-ACF
1 10
2
2
2 HHS-CDC 2
3
3 HHS-ACF 3
3
4 HHS-ACF 10
4
4
4
The omb_control_number variable is constant throughout the data is not missing. I am trying to impute the categories such that all unique omb_control_number have the same agency and hours. I tried using the following:
by omb_control_number, sort : replace agency[_n-1] if missing(agency)
But it filled in only previous values. Is there a way to do this where it won't just fill in previous values? For reference, the final dataset should look like the following:
omb_control_number agency hours
1 HHS-ACF 10
1 HHS-ACF 10
2 HHS-CDC 2
2 HHS-CDC 2
2 HHS-CDC 2
3 HHS-ACF 3
3 HHS-ACF 3
3 HHS-ACF 3
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10

If you do not care about maintaining original sort order, then you can do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte omb_control_number str7 agency byte hours
1 "HHS-ACF" .
1 "" 10
2 "" .
2 "" .
2 "HHS-CDC" 2
3 "" .
3 "HHS-ACF" 3
3 "" .
4 "HHS-ACF" 10
4 "" .
4 "" .
4 "" .
end
gsort omb_control_number -agency
bys omb_control_number : replace agency = agency[_n-1] if missing(agency)
sort omb_control_number hours
bys omb_control_number : replace hours = hours[_n-1] if missing(hours)

If agency is a string variable, then
bysort omb (agency) : replace agency = agency[_N]
will copy the last value after sorting to all observations for the same group.
If agency is a numeric variable with value labels, keep reading.
As hours is presumably a numeric variable, it is the same idea with a twist:
bysort omb (hours) : replace hours = hours[1]
In neither case is there any check for two or more non-missing values for the same identifier.
For a numeric variable, whether with or without value labels, a check would be
bysort omb (hours) : gen byte OK = (hours == hours[1]) | missing(hours)
You should then want to look if any observations are 0 on OK. 1 means "OK".
And from the above string variables can be checked too, with a need to look in the last observation -- indexed by _N-- rather than the first -- indexed by 1.

This will get you the desired results:
bysort omb_control_number: gen nonmissing = sum(!missing(agency)) if !missing(agency)
bysort omb_control_number: gen nonmissing2 = sum(!missing(hours)) if !missing(hours)
bysort omb_control_number (nonmissing) : replace agency = agency[1]
bysort omb_control_number (nonmissing2) : replace hours = hours[1]
drop nonmissing*

In Stata, how can I only analyze observations with repeated measures using the mixed command?

I have a dataset on multiple outcome for individuals in two groups that were treated (or not treated) by an intervention at two time points. However, not every individual has complete data for each measure at each time point.
id
outcome
outcome_value
group
time
1
depression
10
1
1
1
depression
8
1
2
2
depression
10
2
1
2
depression
.
2
2
1
anxiety
12
1
1
1
anxiety
8
1
2
2
anxiety
12
2
1
2
anxiety
6
2
2
How do I exclude IDs that do not have an outcome in both periods? I only want to see how outcomes changed between groups over time for observations have data in all periods. I am using the mixed command in Stata to conduct this analysis.

First drop the missing rows
keep if !missing(outcome_value)
Then, keep the ID/outcome combinations that have _N==2
bysort id outcome: keep if _N==2
Output:
id outcome outco~ue group time ct
1 anxiety 8 1 2 2
1 anxiety 12 1 1 2
1 depression 10 1 1 2
1 depression 8 1 2 2
2 anxiety 6 2 2 2
2 anxiety 12 2 1 2
As #NickCox has pointed out in the comments, while we cannot directly combine these two, there is still a one-line approach:
bysort id outcome (time) : keep if !missing(outcome_value[1], outcome_value[2])
Of note, we cannot do this:
bysort id outcome : keep if !missing(outcome_value) & _N==2
because _N is not reduced by group until after the rows with missing outcome have been removed.

Which function I can use in Stata to replicate a quantitative variable?

I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.

Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.

Generating income of other members in family

I have a dataset like this.
FamilyID Status personID spouseID HeadID spouse_of_referenceID income
1 Head 1 2 1 2
1 Spouse of head 2 1 1 2
1 Child 3 NA 1 2
2 Head 1 3 1 3
2 Spouse of head 3 1 1 3
For every "child" I want to create a variable "parents' income" which is sum of the income of the head and the income of spouse of head.
I am thinking of something like
bysort family: egen parentsincome = if ??? status==4
because status is 4 if the person is a child.
But I am not sure how to proceed to next. I thought about using _n but I couldn't think of a real solution.

That is a weak data example: 7 variables are declared, but only 6 exemplified, and no use of Stata. "NA" isn't a Stata code for missing. Some engineering was needed to make sense of it. Statalist has advice on preparing data examples that applies here too. advice on Stata data examples
You can just get the totals conditional on a person being head or their spouse directly with egen.
clear
input FamilyID str14 Status personID spouseID HeadID spouse_of_referenceID income
1 "Head" 1 2 1 2 1000
1 "Spouse of head" 2 1 1 2 2000
1 "Child" 3 . 1 2 0
2 "Head" 1 3 1 3 3000
2 "Spouse of head" 3 1 1 3 4000
end
egen HSIncome = total(income / inlist(Status, "Head", "Spouse of head")), by(FamilyID )
list FamilyID Status personID income HSIncome, sepby(FamilyID)
+----------------------------------------------------------+
| FamilyID Status personID income HSIncome |
|----------------------------------------------------------|
1. | 1 Head 1 1000 3000 |
2. | 1 Spouse of head 2 2000 3000 |
3. | 1 Child 3 0 3000 |
|----------------------------------------------------------|
4. | 2 Head 1 3000 7000 |
5. | 2 Spouse of head 3 4000 7000 |
+----------------------------------------------------------+
See e.g. this paper Sections 9 and 10 for a review of technique.
If you're using value labels instead to show status, the code will naturally be different.
The help for egen is explicit that you shouldn't try to use _n in conjunction. This is because egen often sorts the data temporarily so observations may change their order in the dataset.

Lag in Stata generates only missing

I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance

There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stata: how to duplicate observations under certain conditions - stata

Related

Impute missing categories

In Stata, how can I only analyze observations with repeated measures using the mixed command?

Which function I can use in Stata to replicate a quantitative variable?

Generating income of other members in family

Lag in Stata generates only missing

Categories

Resources