How to take maximum of rolling window without SSC packages - stata

How can I create a variable in Stata that contains the maximum of a dynamic rolling window applied to another variable? The rolling window must be able to change iteratively within a loop.
max(numvar, L1.numvar, L2.numvar) will give me what I want for a single window size, but how can I change the window size iteratively within the loop?
My current code for calculating the rolling sum (credit to #Nick Cox for the algorithm):
generate var1lagged = 0
forval k = -2(1)2 {
if `k' < 0 {
local k1 = -(`k')
replace var1lagged = var1lagged + L`k1'.var1
}
else replace var1lagged = var1lagged + F`k'.var1
}
How would I achieve the same sort of flexibility but with the maximum, minimum, or average of the window?

In the simplest case suppose K at least 1 is given as the number of lags in the window
local arg L1.numvar
forval k = 2/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
If the window includes the present value, that is just a twist
local arg numvar
forval k = 1/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
More generally, numvar would not be a specific variable name, but would be a local macro including such a name.
EDIT 1
This returns missing as a result only if all arguments are missing. If you wanted to insist on missing as a result if any argument is missing, then go
gen wanted = cond(missing(`arg'), ., max(`arg'))
EDIT 2
Check out rolling more generally. Otherwise for a rolling mean you calculate directly you need to work out (1) the sum as in the question (2) the number of non-missing values.
The working context of the OP evidently rules out installing community-contributed commands; otherwise I would recommend rangestat and rangerun (SSC). Note that many community-contributed commands have been published via the Stata Journal, GitHub or user sites.

Related

Storing a numerical variable in a local macro?

I have a variable called "count," which contains the number of subjects who attend each of 1300 study visits. I would like to store these values in a local macro and display them one by one using a for loop.
E.g.,
local count_disp = count
forvalues i = 1/1300 {
disp `i' of `count_disp'
}
However, I'm unsure how to store the entire list of the count variable in a macro or how to call each "word" in the macro.
Is this possible in Stata?
In case you only want to display all values in order, then it is easier to skip the intermediate step of creating the macro. You can just display the values row by row like this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte count
11
22
33
end
* Loop over the number of observation and display
* the value of variable count for that row number
count
forvalues i = 1/`r(_N)' {
* Display value
display count[`i']
}

Impute missing covariates at random in Stata

I am trying to randomly impute missing data for several covariates using Stata. I have never done this, and I am trying to use this code from a former employee:
local covarall calc_age educcat ipovcat_bl US_born alc_yn2 drug_yn lnlpcbsum tot_iod
local num = 0
foreach j of local covarall {
gen iflag_`j'=0
replace iflag_`j'=1 if `j'==.
local num = `num'+1000
forvalues i = 1/476 {
sort `j'
count if `j'==.
di r(N)
local num2 = `num'+`i'
set seed `num2'
replace `j' in `i'=`j'[1+int((400-r(N))*runiform())] if iflag_`j'[`i']==1
}
}
When I run this, Stata just gives me this over and over forever:
(0 real changes made)
0
0
What am I doing wrong?
The three messages seem interpretable as follows:
replace iflag_`j' = 1 if `j' == .
will lead to a message (0 real changes made) whenever that is so, meaning that the variable in question is never equal to system missing, the requirement for replacement.
count if `j' == .
will lead to the display of 0 in the same circumstance.
di r(N)
ditto. count shows a result by default and then the code insists that it be shown again. Strange style, but not a bug.
All that said the line
replace `j' in `i'=`j'[1+int((400-r(N))*runiform())] if iflag_`j'[`i'] == 1
is quite illegal. My best guess is that you have copied it incorrectly somehow and that it should have been
replace `j' =`j'[1+int((400-r(N))*runiform())] in `i' if iflag_`j'[`i'] == 1
but this too should produce the same message as the first if a value is not missing.
I add that it is utterly pointless to enter the innermost loop if there are no missing values in a variable: there is then nothing to impute.
Changing the seed every time a change is made is strange, but that is partly a matter of taste.

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?
This was my attempt below:
summarize id, meanonly
capture gen partners_income = 0
forvalue ln = 1/`r(max)' {
bys household (id): ///
egen link_`ln' = total(individual_income) if partners_location==`ln')
replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
drop link_*
}
There is general advice in this FAQ.
It can take longer to write a smart way to do this than to use a quick-and-dirty approach.
However, there is a smarter way.
Brute solution
Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.
gen partners_income = .
gen problem = 0
The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)
The reason for the problem variable will become apparent.
I can't see a reason for your capture.
Now we can loop:
quietly forval i = 1/`=_N' {
su individual_income if id == partners_id[`i'], meanonly
replace partners_income = r(max) in `i'
if r(N) > 1 replace problem = r(N) in `i'
}
So, the logic is
foreach observation
find the partner's identifier
find that income: summarize, meanonly is fast
that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.
Notes:
We can make comparison safer by restricting computations to the same household by modifying
if id == partners_id[`i']
to
if id == partners_id[`i'] & household == household[`i']
In one place you have the variable partners_location which looks like a typo for partners_id.
Cute solution
Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:
gen first = cond(id < partner_id, id, partner_id)
gen second = cond(id < partner_id, partner_id, id)
egen joint = concat(first second), p(" ")
first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as
if !missing(partner_id)
Now
bysort household joint : gen partners_income = income[3 - _n] if _N == 2
Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.
If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.

stata - variable operations conditional to existent vars and to a list of varnames

I have this problem.
My dataset has variables like:
sec20_var1 sec22_var1 sec30_var1
sec20_var2 sec22_var2 sec30_var2 sec31_var2
(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)
My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.
The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:
set more off
foreach v in _var02 {
ds *`v'
di "`r(varlist)'"
local sects`v' `r(varlist)'
foreach s in sec26 sec28 sec37 {
capture confirm local sects`v'
if !_rc {
egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')
else {
di "`v' didnt existed"
}
}
}
}
I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.
I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.
Thanks for your help! :)
Thank you #SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).
I used this code, that goes in the same direction that the one made by #SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.
*COMPLETING THE LIST OF COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 _... {
foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 {
capture confirm variable s'v'
if _rc==111 {
gen s'v'=.
}
}
}
*GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 ... {
egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v')
}
*DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR.
set more off
foreach v of varlist * {
gen TEMP=.
replace TEMP=1 if !missing(v')
egen TEMPSUM=sum(TEMP)
if TEMPSUM==0 {
di " >>> Dropping empty variable:v'"
drop `v'
}
drop TEMP TEMPSUM
}
Note that I cutted the list of variables.
I will call what you are referring to as variables as "accounts".
The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:
g start=.
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
g end=.
[indicator calculations go here]
drop start-end
However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).

Getting unknown function mean() in a forvalues loop

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.