How to replace a zero-valued answer by its respective average value? - stata

I have a household data set which includes expenditures for various foods. I categorized them into main food groups and price is obtained by dividing the expenditure value by quantity. For some households price comes as zero since their consumption with respect to the corresponding food group is zero. In such cases, I want to get the price as the average price of the corresponding city, district & province, which that non-consumed household is selected.
How could I do it using STATA?

The mean of the positive values is
egen mean_price = mean(price / (price > 0)), by(province district city)
and you can replace zeros in a clone by
gen price2 = cond(price > 0, price, mean_price)
The division trick can be explained like this. If price > 0 is true, then that expression evaluates to 1; and if false to 0. Dividing by 1 clearly leaves values unchanged. Dividing by 0 creates missings, which egen's mean() function will ignore, which is precisely what is wanted.
There is more discussion of related technique in the article referred to in http://www.stata-journal.com/article.html?article=dm0055
P.S. Stata is the correct spelling. It is an invented word, and was never an acronym.
P.S. You have yet to acknowledge an answer at How to get the difference of two variables, when there are missing values?
LATER:
In this case another way is
egen total = total(price), by(province district city)
egen number = total(price > 0), by(province district city)
gen price2 = cond(price > 0, price, total/number)
as zero prices make no difference to the total. Use doubles throughout.

Related

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Reordering panels by another variable in twoway, by() graphs

Suppose I make the following chart showing the weight of 9 pigs over time:
webuse pig
tw line weight week if inrange(id,1,9), by(id) subtitle(, nospan)
Is it possible to reorder the panels by another variable while retaining the original label? I can imagine defining another variable that is sorted the right way and then labeling it with the right id, but curious if there is a less clunky way of achieving that.
I think you are right: you need a new ordering variable. Positively, you can order on any criterion of choice. Watch out for ties on the variable used to order, which can always broken by referring to the original identifier. Here we sort on final weights, by default smallest first. (For largest first, negate the weight variable.)
webuse pig, clear
keep if id <= 9
bysort id (week) : gen last = weight[_N]
egen newid = group(last id)
bysort newid : gen toshow = strofreal(id) + " (" + strofreal(last, "%2.1f") + ")"
* search labmask for download links
labmask newid , values(toshow)
set scheme s1color
line weight week, by(newid, note("")) sort xla(1/9)
Short papers discussing the principles here are already in train for publication in the Stata Journal in 2021.

Collapsing sum of variable conditional on another variable

I am working with the CES diary data from 2006. I have a file which for each household has an entry for each item bought during a week long period. I have the following variables
newid id of household
cost dollar cost of item
ucc a code denoting the type of item
I am interested in restaurant expenditures which is covered by ucc 190111, 190112, ... . I want to collapse my data so for each newid I have the sum of restaurant expenditures for the household during the week. I used the command
collapse (sum) cost if ucc=="190111".... , by (newid)
However, I would like to have a zero when there are no restaurant expenditures and Stata simply removes those entries.
You need an intermediate variable with some zeros for non-restaurant expenditures:
gen rest_exp = cond(inlist(ucc,"190111","190112"),cost,0)
collapse (sum) rest_exp, by(newid)
One caveat is that inlist() has a constraint of 9 possible values for strings, but you probably have fewer than that or should destring, in which case the limit is 254. You can also hitch a few inlist()s together with |.

Stata - Generating rolling average variable

I'd like to generate a rolling average variable from a basketball dataset. So if the first observation is 25 points on January 1, the generated variable will show 25. If the second observation is 30 points on January 2, the variable generated will show 27.5. If the third observation is 35 points, the variable generated will show 30, etc.
For variable y ordered by some time t at its simplest the average of values to date is
gen yave = sum(y) / _n
which is the cumulative sum divided by the number of observations. If there are occasional missing values, they are ignored by sum() but the denominator needs to be fixed, say
gen yave = sum(y) / sum(y < .)
This generalises easily to panel structure
bysort id (t) : gen yave = sum(y) / sum(y < .)
Here is the solution I came up with. I had to create three variables, a cumulative point total (numerator) and a running count (denominator), then divided the two variables to get player points per game:
gen player_pts = points if player[_n]!=player[_n-1]
replace player_pts=points+player_pts[_n-1] if player[_n]==player[_n-1]&[_n]!=1
by player: gen player_games= [_n]
gen ppg=player_pts/player_games

Creating indicator variable in Stata

In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?
One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.