Impute missing covariates at random in Stata - stata

I am trying to randomly impute missing data for several covariates using Stata. I have never done this, and I am trying to use this code from a former employee:
local covarall calc_age educcat ipovcat_bl US_born alc_yn2 drug_yn lnlpcbsum tot_iod
local num = 0
foreach j of local covarall {
gen iflag_`j'=0
replace iflag_`j'=1 if `j'==.
local num = `num'+1000
forvalues i = 1/476 {
sort `j'
count if `j'==.
di r(N)
local num2 = `num'+`i'
set seed `num2'
replace `j' in `i'=`j'[1+int((400-r(N))*runiform())] if iflag_`j'[`i']==1
}
}
When I run this, Stata just gives me this over and over forever:
(0 real changes made)
0
0
What am I doing wrong?

The three messages seem interpretable as follows:
replace iflag_`j' = 1 if `j' == .
will lead to a message (0 real changes made) whenever that is so, meaning that the variable in question is never equal to system missing, the requirement for replacement.
count if `j' == .
will lead to the display of 0 in the same circumstance.
di r(N)
ditto. count shows a result by default and then the code insists that it be shown again. Strange style, but not a bug.
All that said the line
replace `j' in `i'=`j'[1+int((400-r(N))*runiform())] if iflag_`j'[`i'] == 1
is quite illegal. My best guess is that you have copied it incorrectly somehow and that it should have been
replace `j' =`j'[1+int((400-r(N))*runiform())] in `i' if iflag_`j'[`i'] == 1
but this too should produce the same message as the first if a value is not missing.
I add that it is utterly pointless to enter the innermost loop if there are no missing values in a variable: there is then nothing to impute.
Changing the seed every time a change is made is strange, but that is partly a matter of taste.

Related

Is there a way to extract year range from wide data?

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

How to take maximum of rolling window without SSC packages

How can I create a variable in Stata that contains the maximum of a dynamic rolling window applied to another variable? The rolling window must be able to change iteratively within a loop.
max(numvar, L1.numvar, L2.numvar) will give me what I want for a single window size, but how can I change the window size iteratively within the loop?
My current code for calculating the rolling sum (credit to #Nick Cox for the algorithm):
generate var1lagged = 0
forval k = -2(1)2 {
if `k' < 0 {
local k1 = -(`k')
replace var1lagged = var1lagged + L`k1'.var1
}
else replace var1lagged = var1lagged + F`k'.var1
}
How would I achieve the same sort of flexibility but with the maximum, minimum, or average of the window?
In the simplest case suppose K at least 1 is given as the number of lags in the window
local arg L1.numvar
forval k = 2/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
If the window includes the present value, that is just a twist
local arg numvar
forval k = 1/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
More generally, numvar would not be a specific variable name, but would be a local macro including such a name.
EDIT 1
This returns missing as a result only if all arguments are missing. If you wanted to insist on missing as a result if any argument is missing, then go
gen wanted = cond(missing(`arg'), ., max(`arg'))
EDIT 2
Check out rolling more generally. Otherwise for a rolling mean you calculate directly you need to work out (1) the sum as in the question (2) the number of non-missing values.
The working context of the OP evidently rules out installing community-contributed commands; otherwise I would recommend rangestat and rangerun (SSC). Note that many community-contributed commands have been published via the Stata Journal, GitHub or user sites.

Stata input command not allowing local macros

I found this curious behavior in the input command for Stata.
When you pass a local macro as an argument either for one variable or multiple, the input command gives this error:
'`' cannot be read as a number
Here are two examples that give the same error:
clear
local nums 1 1 1
input a b c
`nums'
end
clear
local num 1
input a b c
1 1 `num'
end
Is there a way to pass macros into the input command?
This is in substance largely a comment on the answer to Aaron Wolf, but the code makes it too awkward to fit in a physical comment.
Given stuff in a local, another way to do it is
clear
local num "1 1 1"
set obs 1
foreach v in a b c {
gettoken this num : num
gen `v' = `this'
}
Naturally, there are many ways to get 1 1 1 into three variables.
This does not pass a macro to the input command per se, but it does achieve your desired result, so perhaps this can help with what you are trying to do?
General idea is to set the value of a variable to a local, then split the local (similar to the text-to-column button in Excel).
clear
local nums "1 1 1"
foreach n of local nums {
if "`nums_2'" == "" local nums_2 "`n'"
else local nums_2 = "`nums_2'/`n'"
}
set obs 1
gen a = "`nums_2'"
split a, parse("/") gen(b) destring
drop a

Why is my forvalues loop in stata not working?

I am trying to find outliers in a variable called nbrs, generating an interquartile (iqr) range called nbrs_iqr. I then want to loop (for practice with this looping concept) the values 1.5, 2, 5, and 10 in to multiply them by the iqr.
I keep getting a syntax error (invalid syntax r(198);) on the loop. I have seen something about not being able to do the forvalues loop when the values are not a range but have seen examples where it is a non-range without being explicit that that is permitted. I figured the spaces worked to separate the non-range values but I've thrown up my hands from there.
sum nbrs, detail
return list
gen nbrs_iqr = r(p75)-r(p25)
tab nbrs_iqr
forvalues i = 1.5 2 5 10 {
gen nbrs_out_`i'=`i'*nbrs_iqr
}
help forvalues is clear on the syntax you can use. Yours is not a valid range. You can work with foreach, but notice that a . in a variable name is not allowed.
One solution is to use strtoname():
clear
set more off
sysuse auto
keep price
sum price, detail
gen nbrs_iqr = r(p75)-r(p25)
foreach i of numlist 1.5 2 5 10 {
local newi = strtoname("`i'")
gen nbrs_out`newi' = `i' * nbrs_iqr
}
describe
My advice: familiarize yourself with help help.

Getting unknown function mean() in a forvalues loop

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.