Stata: combine foreach with by - stata

My data has some missing values for the variable issue. I'm trying to impute the most recent past issue value (for that subject, identified by id1 and id2), if any. If all past issue values are missing, I want the code to leave the current value as missing.
I tried the below code, but Stata says foreach can't be combined with by.
bys id1 id2 (date): foreach v in 1(1)_n {
replace issue[n] = issue[n-v] if !missing(issue[n-v]) and missing(issue[n])==1
}
Is there a way to do this without using foreach with by?

The attempted loop over observations is quite unnecessary, as Stata does that any way.
If you want to use only the most recent non-missing value it is likely that you want this:
clonevar issue, generate(clone)
bys id1 id2 (date): replace issue = clone[n-1] if missing(issue)
Note the following bugs in your code apart from that you flag:
foreach v in 1(1)_n: foreach won't expand a numlist with in; nor will it evaluate _n for you.
replace issue[n]: subscripts are not allowed in that position; replace issue means the same thing any way.
issue[n-v]: you'd need a local reference there.
and is not a keyword: you need & if you want a logical "and"
n presumably is a typo for _n
See also this FAQ on replacing missing values

Related

Is there a way to extract year range from wide data?

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

Check that data are constant within group

I often find myself needing to check whether or not variables are constant within a group. This is how I currently go about this (assume that the group is defined by a-b-c and the variable in question is var):
bys a b c (var): gen isconstant=var[1]==var[_N]
*manually inspect the results of the below tabulation; if all 1's, then it is constant
tab isconstant
drop isconstant
(Note that the above approach assumes that there are no missing observations within a group. I would have to think more about how to approach it if there were missings. And instead of manually checking, could use something along the lines of assert.)
This works fine, but is there a more succinct way to do this? Perhaps a one line solution, roughly analogous to isid ..., but of course checking for something else.
The principle behind your approach is also explained in this FAQ but I am not aware of a dedicated command. Still, it is programmable and you are a programmer, so where is yours?
Here is a quick stab:
*! 1.0.0 NJC 2 March 2020
program homog, sortpreserve
version 8
syntax varname [if] [in] [, MISSing BY(varlist) ]
* missings are ignored by default
if "`missing'" == "" {
marksample touse, strok
if "`by'" != "" markout `touse' `by', strok
}
else marksample touse, novarlist
tempvar OK
bysort `touse' `by' (`varlist') : gen byte `OK' = `varlist'[1] == `varlist'[_N]
quietly summarize `OK' if `touse'
if r(min) == 0 display as err "assertion is false"
end
and some silly examples:
. sysuse auto, clear
(1978 Automobile Data)
. homog mpg
assertion is false
. homog rep78, by(rep78)
. gen one = 1
. homog one
. replace one = . in L
(1 real change made, 1 to missing)
. homog one
. homog one, missing
assertion is false
So, the principles are
No news is good news. The only possible output, other than error messages, is a message "assertion is false". This isn't treated as an error. If your taste runs otherwise, clone the program, rename it and change the way it works.
by() is an option and if specified causes all comparisons to be by the distinct groups of observations so identified.
Missings are ignored by default. The option missing changes that so that for example 42 and missing are reported as different. This applies also to missing values of any by() variables.

Why do I get an invalid syntax error with a foreach loop?

I want to rename variable names starting with intensity. I received an invalid syntax, r(198) error, with the following code.
#delimit;
foreach VAR of varlist intensity* {;
local NEW = subinstr("`VAR'", "intensity", "int");
rename `VAR' `NEW';
};
Your use of the delimiter ; here does not bite, so I will ignore it.
The error is in the use of subinstr(), which must have four arguments, the fourth being the number of substitutions to be made. See help subinstr().
This works (note please the use of a minimal complete verifiable example):
clear
set obs 1
generate intensity1 = 1
generate intensity2 = 2
foreach VAR of varlist intensity* {
local NEW = subinstr("`VAR'", "intensity", "int", 1)
rename `VAR' `NEW'
}
ds
But the loop is utterly unnecessary. First, let's flip the names back and then show how to change names directly:
rename int* intensity*
rename intensity* int*
See help rename groups for more.

Extract the mean from svy mean result in Stata

I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.

stata - variable operations conditional to existent vars and to a list of varnames

I have this problem.
My dataset has variables like:
sec20_var1 sec22_var1 sec30_var1
sec20_var2 sec22_var2 sec30_var2 sec31_var2
(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)
My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.
The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:
set more off
foreach v in _var02 {
ds *`v'
di "`r(varlist)'"
local sects`v' `r(varlist)'
foreach s in sec26 sec28 sec37 {
capture confirm local sects`v'
if !_rc {
egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')
else {
di "`v' didnt existed"
}
}
}
}
I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.
I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.
Thanks for your help! :)
Thank you #SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).
I used this code, that goes in the same direction that the one made by #SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.
*COMPLETING THE LIST OF COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 _... {
foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 {
capture confirm variable s'v'
if _rc==111 {
gen s'v'=.
}
}
}
*GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 ... {
egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v')
}
*DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR.
set more off
foreach v of varlist * {
gen TEMP=.
replace TEMP=1 if !missing(v')
egen TEMPSUM=sum(TEMP)
if TEMPSUM==0 {
di " >>> Dropping empty variable:v'"
drop `v'
}
drop TEMP TEMPSUM
}
Note that I cutted the list of variables.
I will call what you are referring to as variables as "accounts".
The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:
g start=.
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
g end=.
[indicator calculations go here]
drop start-end
However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).