I am working with Stata on a dataset that has many farmer households which are repeated if they have multiple plots: For example like they have 3 plots in which they grow paddy.
Now I would like to find the min quantity of paddy among all the plots for a given household and then drop that row
How do I do this?
Example:
HHID Plot Qty
1 1 1
1 2 3
2 1 0.5
2 2 1
I want to drop qty 1 and 0.5 for household 1 and 2
so my table will be
HHID Plot Qty
1 2 3
2 2 1
bysort HHID (Qty) : drop if _n == 1
Related
I have randomly missing categories in a Stata dataset that look like the following
omb_control_number agency hours
1 HHS-ACF
1 10
2
2
2 HHS-CDC 2
3
3 HHS-ACF 3
3
4 HHS-ACF 10
4
4
4
The omb_control_number variable is constant throughout the data is not missing. I am trying to impute the categories such that all unique omb_control_number have the same agency and hours. I tried using the following:
by omb_control_number, sort : replace agency[_n-1] if missing(agency)
But it filled in only previous values. Is there a way to do this where it won't just fill in previous values? For reference, the final dataset should look like the following:
omb_control_number agency hours
1 HHS-ACF 10
1 HHS-ACF 10
2 HHS-CDC 2
2 HHS-CDC 2
2 HHS-CDC 2
3 HHS-ACF 3
3 HHS-ACF 3
3 HHS-ACF 3
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
4 HHS-ACF 10
If you do not care about maintaining original sort order, then you can do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte omb_control_number str7 agency byte hours
1 "HHS-ACF" .
1 "" 10
2 "" .
2 "" .
2 "HHS-CDC" 2
3 "" .
3 "HHS-ACF" 3
3 "" .
4 "HHS-ACF" 10
4 "" .
4 "" .
4 "" .
end
gsort omb_control_number -agency
bys omb_control_number : replace agency = agency[_n-1] if missing(agency)
sort omb_control_number hours
bys omb_control_number : replace hours = hours[_n-1] if missing(hours)
If agency is a string variable, then
bysort omb (agency) : replace agency = agency[_N]
will copy the last value after sorting to all observations for the same group.
If agency is a numeric variable with value labels, keep reading.
As hours is presumably a numeric variable, it is the same idea with a twist:
bysort omb (hours) : replace hours = hours[1]
In neither case is there any check for two or more non-missing values for the same identifier.
For a numeric variable, whether with or without value labels, a check would be
bysort omb (hours) : gen byte OK = (hours == hours[1]) | missing(hours)
You should then want to look if any observations are 0 on OK. 1 means "OK".
And from the above string variables can be checked too, with a need to look in the last observation -- indexed by _N-- rather than the first -- indexed by 1.
This will get you the desired results:
bysort omb_control_number: gen nonmissing = sum(!missing(agency)) if !missing(agency)
bysort omb_control_number: gen nonmissing2 = sum(!missing(hours)) if !missing(hours)
bysort omb_control_number (nonmissing) : replace agency = agency[1]
bysort omb_control_number (nonmissing2) : replace hours = hours[1]
drop nonmissing*
I have a dataset on multiple outcome for individuals in two groups that were treated (or not treated) by an intervention at two time points. However, not every individual has complete data for each measure at each time point.
id
outcome
outcome_value
group
time
1
depression
10
1
1
1
depression
8
1
2
2
depression
10
2
1
2
depression
.
2
2
1
anxiety
12
1
1
1
anxiety
8
1
2
2
anxiety
12
2
1
2
anxiety
6
2
2
How do I exclude IDs that do not have an outcome in both periods? I only want to see how outcomes changed between groups over time for observations have data in all periods. I am using the mixed command in Stata to conduct this analysis.
First drop the missing rows
keep if !missing(outcome_value)
Then, keep the ID/outcome combinations that have _N==2
bysort id outcome: keep if _N==2
Output:
id outcome outco~ue group time ct
1 anxiety 8 1 2 2
1 anxiety 12 1 1 2
1 depression 10 1 1 2
1 depression 8 1 2 2
2 anxiety 6 2 2 2
2 anxiety 12 2 1 2
As #NickCox has pointed out in the comments, while we cannot directly combine these two, there is still a one-line approach:
bysort id outcome (time) : keep if !missing(outcome_value[1], outcome_value[2])
Of note, we cannot do this:
bysort id outcome : keep if !missing(outcome_value) & _N==2
because _N is not reduced by group until after the rows with missing outcome have been removed.
Hello this is my data sample
coustmer_NO id
1 5
1 13
2 4
2 4
2 4
3 4
3 10
4 8
4 8
using SQL >> I Would like to count for each customer how many different ID They have.
the expected output is:
coustmer_NO total_id
1 2
2 1
3 2
4 1
I guess there is a typo in your data,
The result should be:
coustmer_NO total_id
1 2
2 1
3 2
4 1
You can do the following:
SELECT costumer_NO, count(distinct id) AS total_id FROM <table_name> GROUP BY costumer_NO;
Try this query in MYSQL:
select coustmer_NO, count(distinct id) as 'total_id' from table_name group by coustmer_NO;
data1 is data from 1990 and it looks like
Panelkey Region income
1 9 30
2 1 20
4 2 40
data2 is data from 2000 and it looks like
Panelkey Region income
3 2 40
2 1 30
1 1 20
I want to add a column of where each person lived in 1990.
Panelkey Region income Region1990
3 2 40 .
2 1 30 1
1 1 20 9
How can I do this on Stata?
The following code will deal with panels that live in multiple regions in the same year by choosing the region with larger income. This would make sense if income was proportional to fraction of the year spent in a region. Same income ties will be broken arbitrarily using the highest region's value. Other types of aggregation might make sense (take a look at the -collapse- command).
Note that I tweaked your data by inserting second rows for the last observation in each year:
clear
input Panelkey Region income
1 9 30
2 1 20
4 2 40
4 10 80
end
rename (Region income) =1990
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
save "data1990.dta", replace
clear
input Panelkey Region income
3 2 40
2 1 30
1 1 20
1 9 20
end
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
merge 1:1 Panelkey using "data1990.dta", keep(match master) nogen
list, clean noobs
Please help me duplicate a variable under certain conditions? My original dataset looks like this:
week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
This table says that for each week, there is a unique average price for each category of goods.
I need to create the following variables:
averageprice1 (av. price for category 1)
averageprice2 (av. price for category 2)
such that:
week category averageprice1 averageprice2
1 1 5 6
1 2 5 6
2 1 4 7
2 2 4 7
meaning that for week 1, average price for category 1 stayed at $5, and av. price for cater 2 stayed at 6. Similar logic applies to week 2.
As you could see that the new variables are duplicated depending on a week.
I am still learning Stata. I tried:
bysort week: replace averageprice1=averageprice if categ==1
but it doesn't work as expected.
You are not duplicating observations (meaning here in the Stata sense, i.e. cases or records) here at all, as (1) the number of observations remains the same (2) you are copying certain values, not the contents of observations. Similar comment on "duplicating variables". However, that's just loose use of terminology.
Taking your example very literally
clear
input week category averageprice
1 1 5
1 2 6
2 1 4
2 2 7
end
bysort week (category) : gen averageprice1 = averageprice[1]
by week: gen averageprice2 = averageprice[2]
l
+--------------------------------------------------+
| week category averag~e averag~1 averag~2 |
|--------------------------------------------------|
1. | 1 1 5 5 6 |
2. | 1 2 6 5 6 |
3. | 2 1 4 4 7 |
4. | 2 2 7 4 7 |
+--------------------------------------------------+
This is a standard application of subscripting with by:. Your code didn't work because it did not oblige Stata to look in other observations when that is needed. In fact your use of bysort week did not affect how the code applied at all.
EDIT:
A generalization is
egen averageprice1 = mean(averageprice / (category == 1)), by(week)
egen averageprice2 = mean(averageprice / (category == 2)), by(week)