I'm using the levelsof command to identify unique values of a variable and stick them into a macro. Then later on I'd like to use those values in the macro to select records from another dataset that I'll load.
What i have in mind is something along the following lines:
keep if inlist(variable, "`macrovariable'")
Does that work? And is there another more efficient option? I could do this easily in R (because vectors are easier to work with than macros), but this project requires Stata.
Clarification:
if I have a variable with three unique values, a, b and c, I want to store those in a macro variable so I can later take another dataset and select observations that match one of those values.
Normally can use the inlist function to do this manually, but I'd like to soft-code it so I can run the program with different sets of values. And I can't get the inlist function to work with macros.
* the source data
levelsof x, local( allx )
* make it -inlist-friendly
local allxcommas : subinstr local allx " " ", ", all
* bring in the new data
use using blah.dta if inlist(x, `allxcommas')
I suspect your difficulty in using a macro generated by levelsof with inlist is that you forgot to use the separate(,) option. I also do not believe you can use the inlist function with keep if-- you will need to add the extra step of defining a new indicator.
In the example below I used the 1978 auto data and created a variable make_abb of vehicle manufacturers (or make) which took only a handful of distinct values ("Do" for Dodge, etc.).
I then used the levelsof command to generate a local macro of the manufacturers which had a vehicle model with a poor repair record (the variable rep78 is a categorical repair record variable where 1 is poor and 5 is good). The option separate(,) is what adds the commas into the macro and enables inlist to read it later on.
Finally, if I want to drop the manufacturers which did not have a poor repair record, I generate a dummy variable named "keep_me" and fill it in using the inlist function.
*load some data
sysuse auto
*create some make categories by splitting the make and model string
gen make_abb=substr(make,1,2)
lab var make_abb "make abbreviation (string)"
*use levelsof with "local(macro_name)" and "separate(,)" options
levelsof make_abb if rep78<=2, separate(,) local(make_poor)
*generate a dummy using inlist and your levelsof macro from above
gen keep_me=1 if inlist(make_abb,`make_poor')
lab var keep_me "dummy of makes that had a bad repair record"
*now you can discard the rest of your data
keep if keep_me==1
This seems to work for me.
* "using" data
clear
tempfile so
set obs 10
foreach v in list a b c d {
generate `v' = runiform()
}
save `so'
* "master" data
clear
set obs 10
foreach v in list e f g h {
generate `v' = runiform()
}
* merge
local tokeepusing a b
merge 1:1 _n using `so', keepusing(`tokeepusing')
Yields:
. list
+------------------------------------------------------------------------------------------+
| list e f g h a b _merge |
|------------------------------------------------------------------------------------------|
1. | .7767971 .5910658 .6107377 .7256517 .357592 .8953723 .0871481 matched (3) |
2. | .643114 .6305301 .6441092 .7770287 .5247816 .4854506 .3840067 matched (3) |
3. | .3833295 .175099 .4530386 .5267127 .628081 .2273252 .0460549 matched (3) |
4. | .0057233 .1090542 .1437526 .3133509 .604553 .9375801 .8091199 matched (3) |
5. | .8772233 .6420991 .5403687 .1591801 .5742173 .8948932 .4121684 matched (3) |
|------------------------------------------------------------------------------------------|
6. | .6526399 .5137199 .933116 .5415702 .4313532 .8602547 .5049801 matched (3) |
7. | .2033027 .8745837 .8609 .0087578 .9844069 .1909852 .3695011 matched (3) |
8. | .6363281 .0064866 .6632325 .307236 .9544498 .6267227 .2908498 matched (3) |
9. | .366027 .4896181 .0955155 .4972361 .9161932 .7391482 .414847 matched (3) |
10. | .8637221 .8478178 .5457179 .8971257 .9640535 .541567 .1966634 matched (3) |
+------------------------------------------------------------------------------------------+
Does this answer your question? If not, please comment.
Related
I want to store a list of variables in a macro and then call that macro inside a mi() statement. The original application is for a programme that uses data I cannot bring online for secrecy reasons, and which will include the following statement:
generate u = cond(mi(`vars'),., runiform(0,1))
The issue being that mi() requires comma separated variable names but vars is delimited by spaces.
I use the auto dataset and mark to illustrate my problem:
sysuse auto
local myvars foreign price
mark missing if mi(`myvars')
In this example, mi() asks for arguments separated by commas, Stata stops and complains that it cannot find a foreignprice variable. Is there a utility function that will insert the commas between the macro elements?
A direct answer to the question as set is to use the macro extended function subinstr to change spaces to commas:
sysuse auto
local myvars foreign price
local myvars : subinstr local myvars " " ",", all
mark missing if mi(`myvars')
If the aim is to create a marker variable that marks observations with any values missing on specified variables, then there are other alternative ways, most of which don't need any fiddling with separators in a list. This doesn't purport to be a complete set.
A1.
regress foreign price
gen missing = !e(sample)
A2.
egen missing = rowmiss(foreign price)
replace missing = missing > 0
A3.
local myvars foreign price
local myvars : subinstr local myvars " " ",", all
gen missing = missing(`myvars')
A4.
gen missing = 0
quietly foreach v in foreign price {
replace missing = 1 if missing(`v')
}
A5.
mark missing
markout missing foreign price
replace missing = !missing
EDIT In the edited question there is reference to this within a program:
generate u = cond(mi(`vars'),., runiform(0,1))
I wouldn't do that, even with the macro edited to include commas too, although any issue is more one of personal taste.
marksample touse
markout `vars'
generate u = runiform(0,1) if `touse'
It's likely that the indicator variable so produced is needed, or at least useful, somewhere else in the same program.
I have two large datasets (more than 1000 variables in each), one of which has all the variables of the second, plus additional variables. I would like to get a list of all these additional variables, and then drop them and append one dataset to another. I have tried the command dta_equal, but got the same problem found here: http://www.stata.com/statalist/archive/2011-08/msg00308.html
I guess append, keep() cannot realize what I want to do directly, i.e., cannot append dataset while drop additional variables since I have to manually type in variables one by one in the keep() option, which is not realistic given my large dataset.
Are there any ways to deal with this?
There are several Stata commands that can be useful here.
The unab command is used in the first example to make a list of variable in the dataset with fewer variables. The second and third example use the describe command to obtain the list of variables in a dataset not currently in memory.
The final part the the example shows how to use extended macro list functions to obtain a list of common variables and the set of variables not common to both datasets.
* simulate 2 datasets, one has more variables than the other
sysuse auto, clear
save "data1.dta", replace
gen x = _n
gen y = -_n
save "data2.dta", replace
* example 1: drop after append
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta"
replace source = 2 if mi(source)
keep `vcommon' source
* example 2: drop first then append
clear
describe using "data1.dta", varlist short
local vcommon `r(varlist)'
use `vcommon' using "data2.dta", clear
gen source = 2
append using "data1.dta"
replace source = 1 if mi(source)
* example 3: append and keep on the fly
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta", keep(`vcommon')
replace source = 2 if mi(source)
* use extended macro list functions to manipulate variable list
clear
describe using "data1.dta", varlist short
local vlist1 `r(varlist)'
describe using "data2.dta", varlist short
local vlist2 `r(varlist)'
local vcommon : list vlist1 & vlist2
local vinonly1 : list vlist1 - vlist2
local vinonly2 : list vlist2 - vlist1
dis "common variables = `vcommon'"
dis "variables in data1 not found in data2 = `vinonly1'"
dis "variables in data2 not found in data1 = `vinonly2'"
I have a set of variables the list of which I have saved in a global macro so that I can use them in a function
global inlist_cond "amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss"
The reason why they are saved in a macro is because the list will be in a loop and its content will change depending on the year.
What I need to do is to generate a dummy variable so that water_dummy == 1 if any of the variables in the macro list has the WATER classification. In Stata, I need to write
gen water_dummy = inlist("WATER", "$inlist_cond")
, which--ideally--should translate to
gen water_dummy = inlist("WATER", amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss)
But this did not work---the code executed without any errors but the dummy variable only contained 0s. I know that it is possible to invoke macros inside functions in Stata, but I have never tried it when the macro contains a whole list of conditions. Any thoughts?
With a literal string specified, which the double quotes in the generate statement insist on, then you are comparing text with text and the comparison is not with the data at all.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen a = "water"
. gen b = "wine"
. gen c = "beer"
. global myvars "a,b,c"
. gen found1 = inlist("water", "$myvars")
. gen found2 = inlist("water", $myvars)
. list
+---------------------------------------+
| a b c found1 found2 |
|---------------------------------------|
1. | water wine beer 0 1 |
+---------------------------------------+
The first comparison is equivalent to
. di inlist("water", "a,b,c")
0
which finds no match, as "water" is not matched by the (single!) other argument.
Macro references are certainly allowed within function or command calls: as each macro name is replaced by its contents before the syntax is checked, the function or command never even knows that a macro reference was ever used.
As #Aspen Chen concisely points out, omitting the double quotes gives what you want so long as the inlist() syntax remains legal.
If your data structure is something like in the following example, you can try the egen function incss, from egenmore (ssc install egenmore):
clear
set more off
input ///
str15(amz2009 amz2010)
"water" "juice"
"milk" "water"
"lemonade" "wine"
"water & beer" "tea"
end
list
egen watindic = incss(amz*), sub(water)
list
Be aware it searches for substrings (see the result for the last example observation).
A solution with a loop achieving different results is:
gen watindic2 = 0
forvalues i = 2009/2010 {
replace watindic2 = 1 if amz`i' == "water"
}
list
Another solution involves reshape, but I'll leave it at that.
I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.
I need to generate all possible tuples of the integer numbers 1,2,3,4 (with exactly 2 items in each tuple).Then, I need to generate a set of variables that would correspond to the resulting six tuples. Each variable name should contain a reference to a tuple and the value of each variable should be a string version of a tuple itself, as illustrated below:
+--------+--------+--------+--------+--------+--------+
| var_12 | var_13 | var_14 | var_23 | var_24 | var_34 |
+--------+--------+--------+--------+--------+--------+
| 12 | 13 | 14 | 23 | 24 | 34 |
+--------+--------+--------+--------+--------+--------+
While the tuples are generated by using the tuples user-written command (for details, see http://ideas.repec.org/c/boc/bocode/s456797.html), I am stumbling with generating new variables and assigning values to them in a loop. The code looks as follows and results in a syntax error which presumably stems from using local tuples macros incorrectly, and I would greatly appreciate if someone could help me solving it.
tuples 1 2 3 4, display min(2) max(2)
forval i = 1/`ntuples' {
gen v`i'=`tuple`i''
rename v`i' var_`tuple`i''
}
tuples is a user-written command from SSC. Over at www.statalist.org you would be expected to explain where it comes from, and that's a very good idea here too.
In your case, you want say integers such as 12 to represent a tuple such as "1 2" but the latter looks malformed to Stata when you are creating a numeric variable. Stata certainly won't elide the space(s) even if all characters presented otherwise are numeric. So you need to do that explicitly. At the same name giving a variable one name and then promptly renaming it can be compressed.
forval i = 1/`ntuples' {
local I : subinstr local tuple`i' " " "", all
gen var_`I' = `I'
}
Creating a string variable for the tuple with space included would make part of that unnecessary, but the space is still not allowed in the variable name:
forval i = 1/`ntuples' {
local I : subinstr local tuple`i' " " "_", all
gen var_`I' = "`tuple`i''"
}
If this is the whole of your problem, it would have been quicker to write out 6 generate statements! If this is a toy problem representative of something larger, watch out that say "1 23" and "12 3" would both be mapped to "123", so eliding the spaces is unambiguous only with single digit integers; hence the appeal of holding strings as such.
I am still curious how holding the same tuple in every observation of a variable is a good idea; perhaps your larger purpose would be better met by using string scalars or the local macros themselves.