Getting unknown function mean() in a forvalues loop - stata

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}

Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.

Related

Is there a way to extract year range from wide data?

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

Day of the week effect - excluding dummy variables not individually

I want to test the day of the week effect of stock returns. The stata code I have written works, but looks fairly inefficient.
// 1) Monday effect
eststo:reg return day_dummy2 day_dummy3 day_dummy4 day_dummy5
// 2) Tuesday effect
eststo:reg return day_dummy1 day_dummy3 day_dummy4 day_dummy5
// 3) Wednesday effect
eststo:reg return day_dummy1 day_dummy2 day_dummy4 day_dummy5
and so on.
Is there a way to write a code with the same function (excluding one day at a time) with e.g. a foreach loop?
Thank you very much for your help!
A bit clunky, perhaps, but you could use Stata's macro (see help extended_fcn) functions to iteratively exclude one of your listed variables and generate the list of remaining variables.
local vars "day1 day2 day3 day4 day5 day6 day7"
forvalues i = 1/7 {
local varexclude : word `i' of `vars'
local varsout`i' : subinstr local vars "`varexclude'" ""
// insert -estout- command here
}
macro list // to verify the individual `varsout`i'' local macros
You can obtain the initial varlist with ds day*, which stores the variable list in r(varlist).

Generate new variable using min/max in Stata

After many years away from Stata I am currently editing code which repeatedly does something like this:
egen min = min(x)
egen max = max(x)
generate xn = (x - min) / (max - min)
drop min max
I want to reduce this code to one line. But neither of the two "natural" ways that come to my mind work.
gen x_index = (x - min(x)) / (max(x)- min(x))
egen x_index = (x - min(x)) / (max(x)- min(x))
What pieces of the Stata logic am I missing?
The Stata functions max() and min() require two or more arguments and operate rowwise (across observations) if given a variable as any one of the arguments. Documented at e.g. help max().
The egen functions max() and min() can only be used within egen calls. They could be applied with single variables, but their use to calculate single maxima or minima is grossly inefficient unless exceptionally it is essential to store the single result in a variable. Documented except for the warnings at help egen.
Neither approach you consider will work without becoming more roundabout. Consider
su x, meanonly
gen x_index = (x - r(min)) / (r(max)- r(min))
In some circumstances it might be more efficient to calculate the range just once:
su x, meanonly
scalar range = r(max) - r(min)
gen x_index = (x - r(min)) / range
In a program it would usually be better to give the scalar a temporary name.
Within egen calls, an egen function can be called only once.

Generate difference between observations as new variable in Stata

I am trying to get the difference between the natural logarithms of two consecutive observations for a set of variables.
My approach is as follows
. gen abandon_qry_ln = ln(abandon_qry) - ln(abandon_qry) [_n-1]
But I get the error weights not allowed.
Any idea what could be the issue?
You could work with
gen difference = ln(abandon_qry) - ln(abandon_qry[_n-1])
or
gen ln_abandon_qry = ln(abandon_qry)
gen difference = ln_abandon_qry - ln_abandon_qry[_n-1]
You were trying to subscript an expression. You may subscript a variable or a matrix in Stata, but not in general an expression.

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?
This was my attempt below:
summarize id, meanonly
capture gen partners_income = 0
forvalue ln = 1/`r(max)' {
bys household (id): ///
egen link_`ln' = total(individual_income) if partners_location==`ln')
replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
drop link_*
}
There is general advice in this FAQ.
It can take longer to write a smart way to do this than to use a quick-and-dirty approach.
However, there is a smarter way.
Brute solution
Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.
gen partners_income = .
gen problem = 0
The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)
The reason for the problem variable will become apparent.
I can't see a reason for your capture.
Now we can loop:
quietly forval i = 1/`=_N' {
su individual_income if id == partners_id[`i'], meanonly
replace partners_income = r(max) in `i'
if r(N) > 1 replace problem = r(N) in `i'
}
So, the logic is
foreach observation
find the partner's identifier
find that income: summarize, meanonly is fast
that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.
Notes:
We can make comparison safer by restricting computations to the same household by modifying
if id == partners_id[`i']
to
if id == partners_id[`i'] & household == household[`i']
In one place you have the variable partners_location which looks like a typo for partners_id.
Cute solution
Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:
gen first = cond(id < partner_id, id, partner_id)
gen second = cond(id < partner_id, partner_id, id)
egen joint = concat(first second), p(" ")
first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as
if !missing(partner_id)
Now
bysort household joint : gen partners_income = income[3 - _n] if _N == 2
Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.
If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.