Extract the mean from svy mean result in Stata - stata

I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.

It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.

Related

How to separate Stata macro `varlist' with commas for using in mi( ) and inlist( )?

I want to store a list of variables in a macro and then call that macro inside a mi() statement. The original application is for a programme that uses data I cannot bring online for secrecy reasons, and which will include the following statement:
generate u = cond(mi(`vars'),., runiform(0,1))
The issue being that mi() requires comma separated variable names but vars is delimited by spaces.
I use the auto dataset and mark to illustrate my problem:
sysuse auto
local myvars foreign price
mark missing if mi(`myvars')
In this example, mi() asks for arguments separated by commas, Stata stops and complains that it cannot find a foreignprice variable. Is there a utility function that will insert the commas between the macro elements?
A direct answer to the question as set is to use the macro extended function subinstr to change spaces to commas:
sysuse auto
local myvars foreign price
local myvars : subinstr local myvars " " ",", all
mark missing if mi(`myvars')
If the aim is to create a marker variable that marks observations with any values missing on specified variables, then there are other alternative ways, most of which don't need any fiddling with separators in a list. This doesn't purport to be a complete set.
A1.
regress foreign price
gen missing = !e(sample)
A2.
egen missing = rowmiss(foreign price)
replace missing = missing > 0
A3.
local myvars foreign price
local myvars : subinstr local myvars " " ",", all
gen missing = missing(`myvars')
A4.
gen missing = 0
quietly foreach v in foreign price {
replace missing = 1 if missing(`v')
}
A5.
mark missing
markout missing foreign price
replace missing = !missing
EDIT In the edited question there is reference to this within a program:
generate u = cond(mi(`vars'),., runiform(0,1))
I wouldn't do that, even with the macro edited to include commas too, although any issue is more one of personal taste.
marksample touse
markout `vars'
generate u = runiform(0,1) if `touse'
It's likely that the indicator variable so produced is needed, or at least useful, somewhere else in the same program.

Generate variables with loop over pairs of variables

I have data on quantities and Values for a set of countries, and currently the variable names are Q_US V_US Q_UK V_UK Q_France V_France and in that order: Quantity_country Value_country, etc.
For each country (US, UK, France, etc.) I want to generate a new variable that gives me the unit value. Manually I would create them as
gen unit_US = V_US/Q_US
gen unit_UK = V_UK/Q_UK
gen unit_France = V_France/Q_France
But I have 100+ countries, and it would be great to do this in a loop if possible.
Is there an easy way to do this?
Let's get a list of all the countries as you have used them in variable names.
unab where : V_*
local where " `where'"
local where : subinstr local where " V_" " ", all
The additional space is designed to ensure that the text removed is just the prefix V_ at the start of variable names. For another example of using unab, see this FAQ.
Check it worked:
display "`where'"
Now loop:
foreach c of local where {
gen unit_`c' = V_`c'/Q_`c'
}
I'd also consider reshape long.

Stata: compare two datasets and drop different variables

I have two large datasets (more than 1000 variables in each), one of which has all the variables of the second, plus additional variables. I would like to get a list of all these additional variables, and then drop them and append one dataset to another. I have tried the command dta_equal, but got the same problem found here: http://www.stata.com/statalist/archive/2011-08/msg00308.html
I guess append, keep() cannot realize what I want to do directly, i.e., cannot append dataset while drop additional variables since I have to manually type in variables one by one in the keep() option, which is not realistic given my large dataset.
Are there any ways to deal with this?
There are several Stata commands that can be useful here.
The unab command is used in the first example to make a list of variable in the dataset with fewer variables. The second and third example use the describe command to obtain the list of variables in a dataset not currently in memory.
The final part the the example shows how to use extended macro list functions to obtain a list of common variables and the set of variables not common to both datasets.
* simulate 2 datasets, one has more variables than the other
sysuse auto, clear
save "data1.dta", replace
gen x = _n
gen y = -_n
save "data2.dta", replace
* example 1: drop after append
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta"
replace source = 2 if mi(source)
keep `vcommon' source
* example 2: drop first then append
clear
describe using "data1.dta", varlist short
local vcommon `r(varlist)'
use `vcommon' using "data2.dta", clear
gen source = 2
append using "data1.dta"
replace source = 1 if mi(source)
* example 3: append and keep on the fly
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta", keep(`vcommon')
replace source = 2 if mi(source)
* use extended macro list functions to manipulate variable list
clear
describe using "data1.dta", varlist short
local vlist1 `r(varlist)'
describe using "data2.dta", varlist short
local vlist2 `r(varlist)'
local vcommon : list vlist1 & vlist2
local vinonly1 : list vlist1 - vlist2
local vinonly2 : list vlist2 - vlist1
dis "common variables = `vcommon'"
dis "variables in data1 not found in data2 = `vinonly1'"
dis "variables in data2 not found in data1 = `vinonly2'"

Handling string variables inside collapse command

Edit: I should have generated better data. It isn't necessarily the case that the string variable is destringable. I'm just being lazy here (I don't know how to generate random letters).
I have a data set with a lot of strings that I want to collapse, but it seems that in general collapse doesn't place nicely with strings, particularly (firstnm) and (count). Here are some similar data.
clear
set obs 9
generate mark = .
replace mark = 1 in 1
replace mark = 2 in 6
generate name = ""
generate random = ""
local i = 0
foreach first in Tom Dick Harry {
foreach last in Smith Jones Jackson {
local ++i
replace name = "`first' `last'" in `i'
replace random = string(runiform())
}
}
I want to collapse on "mark", which is simple enough with replace and subscripts.
replace mark = mark[_n - 1] if missing(mark)
But my collapses fail with type mismatch errors.
collapse (firstnm) name (count) random, by(mark)
If I use (first), then the first error clears, but (count) still fails. Is there a solution that avoids an additional by operation?
It seems that the following works, but would also be a lot more time-consuming for my data.
generate nonmissing_random = !missing(random)
egen nonmissing_random_count = count(nonmissing_random), by(mark)
collapse (first) name nonmissing_random_count, by(mark)
Or is any solution that facilitates using collapse the same?
You can use destring random,replace and then the following works:
collapse (first) name (count) random, by(mark)
mark name random
1 Tom Smith 5
2 Dick Jackson 4
But collapse (firstnm) name (count) random, by(mark) still generates mismatch error.
Thinking on this some more, my egen count with by operation isn't necessary. I can generate a 1/0 variable for nonmissing/missing string variables then use (sum) in collapse.
generate nonmissing_random = !missing(random)
collapse (first) name (sum) nonmissing_random, by(mark)

stata - variable operations conditional to existent vars and to a list of varnames

I have this problem.
My dataset has variables like:
sec20_var1 sec22_var1 sec30_var1
sec20_var2 sec22_var2 sec30_var2 sec31_var2
(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)
My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.
The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:
set more off
foreach v in _var02 {
ds *`v'
di "`r(varlist)'"
local sects`v' `r(varlist)'
foreach s in sec26 sec28 sec37 {
capture confirm local sects`v'
if !_rc {
egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')
else {
di "`v' didnt existed"
}
}
}
}
I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.
I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.
Thanks for your help! :)
Thank you #SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).
I used this code, that goes in the same direction that the one made by #SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.
*COMPLETING THE LIST OF COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 _... {
foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 {
capture confirm variable s'v'
if _rc==111 {
gen s'v'=.
}
}
}
*GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 ... {
egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v')
}
*DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR.
set more off
foreach v of varlist * {
gen TEMP=.
replace TEMP=1 if !missing(v')
egen TEMPSUM=sum(TEMP)
if TEMPSUM==0 {
di " >>> Dropping empty variable:v'"
drop `v'
}
drop TEMP TEMPSUM
}
Note that I cutted the list of variables.
I will call what you are referring to as variables as "accounts".
The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:
g start=.
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
g end=.
[indicator calculations go here]
drop start-end
However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).