I often find myself needing to check whether or not variables are constant within a group. This is how I currently go about this (assume that the group is defined by a-b-c and the variable in question is var):
bys a b c (var): gen isconstant=var[1]==var[_N]
*manually inspect the results of the below tabulation; if all 1's, then it is constant
tab isconstant
drop isconstant
(Note that the above approach assumes that there are no missing observations within a group. I would have to think more about how to approach it if there were missings. And instead of manually checking, could use something along the lines of assert.)
This works fine, but is there a more succinct way to do this? Perhaps a one line solution, roughly analogous to isid ..., but of course checking for something else.
The principle behind your approach is also explained in this FAQ but I am not aware of a dedicated command. Still, it is programmable and you are a programmer, so where is yours?
Here is a quick stab:
*! 1.0.0 NJC 2 March 2020
program homog, sortpreserve
version 8
syntax varname [if] [in] [, MISSing BY(varlist) ]
* missings are ignored by default
if "`missing'" == "" {
marksample touse, strok
if "`by'" != "" markout `touse' `by', strok
}
else marksample touse, novarlist
tempvar OK
bysort `touse' `by' (`varlist') : gen byte `OK' = `varlist'[1] == `varlist'[_N]
quietly summarize `OK' if `touse'
if r(min) == 0 display as err "assertion is false"
end
and some silly examples:
. sysuse auto, clear
(1978 Automobile Data)
. homog mpg
assertion is false
. homog rep78, by(rep78)
. gen one = 1
. homog one
. replace one = . in L
(1 real change made, 1 to missing)
. homog one
. homog one, missing
assertion is false
So, the principles are
No news is good news. The only possible output, other than error messages, is a message "assertion is false". This isn't treated as an error. If your taste runs otherwise, clone the program, rename it and change the way it works.
by() is an option and if specified causes all comparisons to be by the distinct groups of observations so identified.
Missings are ignored by default. The option missing changes that so that for example 42 and missing are reported as different. This applies also to missing values of any by() variables.
Related
In a Stata program I'm creating, I need to know whether a program parameter is a factor variable or not.
program define my_program, rclass
syntax varname(fv)
if ... {
display "`varlist' is a factor variable"
} else {
display "`varlist' is NOT a factor variable"
}
...
end
my_program age
my_program i.gender
How could I write the if condition to make this work? I would prefer to get this working without checking if varname begins with "i.". Stata knows whether it's a factor variable or not since Stata offers the "fv" option (ie. varname(fv)). So how can I tap into the functionality built into Stata to determine this?
Thanks!
I am embarrassed by the code shown below, but it does point a direction to a solution for you, by comparing the results of unab and fvunab applied to your variable list.
. sysuse auto, clear
(1978 Automobile Data)
. capture unab mac_unab : i.foreign
. display _rc
101
. capture fvunab mac_unab : i.foreign
. display _rc
0
. capture tsunab mac_unab : i.foreign
. display _rc
101
.
I found out that syntax returns a macro s(fvops), "which will be equal to 'true' when factor variables are specified and empty otherwise."
(http://www.stata.com/support/faqs/programming/factor-variable-support/)
Therefore, I'm able to achieve what I wanted with the following code:
program define is_categorical, rclass
syntax varname(fv)
return scalar is_categorical = ("`s(fvops)'" == "true")
end
is_categorical i.education_level
My data has some missing values for the variable issue. I'm trying to impute the most recent past issue value (for that subject, identified by id1 and id2), if any. If all past issue values are missing, I want the code to leave the current value as missing.
I tried the below code, but Stata says foreach can't be combined with by.
bys id1 id2 (date): foreach v in 1(1)_n {
replace issue[n] = issue[n-v] if !missing(issue[n-v]) and missing(issue[n])==1
}
Is there a way to do this without using foreach with by?
The attempted loop over observations is quite unnecessary, as Stata does that any way.
If you want to use only the most recent non-missing value it is likely that you want this:
clonevar issue, generate(clone)
bys id1 id2 (date): replace issue = clone[n-1] if missing(issue)
Note the following bugs in your code apart from that you flag:
foreach v in 1(1)_n: foreach won't expand a numlist with in; nor will it evaluate _n for you.
replace issue[n]: subscripts are not allowed in that position; replace issue means the same thing any way.
issue[n-v]: you'd need a local reference there.
and is not a keyword: you need & if you want a logical "and"
n presumably is a typo for _n
See also this FAQ on replacing missing values
I have a set of variables the list of which I have saved in a global macro so that I can use them in a function
global inlist_cond "amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss"
The reason why they are saved in a macro is because the list will be in a loop and its content will change depending on the year.
What I need to do is to generate a dummy variable so that water_dummy == 1 if any of the variables in the macro list has the WATER classification. In Stata, I need to write
gen water_dummy = inlist("WATER", "$inlist_cond")
, which--ideally--should translate to
gen water_dummy = inlist("WATER", amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss)
But this did not work---the code executed without any errors but the dummy variable only contained 0s. I know that it is possible to invoke macros inside functions in Stata, but I have never tried it when the macro contains a whole list of conditions. Any thoughts?
With a literal string specified, which the double quotes in the generate statement insist on, then you are comparing text with text and the comparison is not with the data at all.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen a = "water"
. gen b = "wine"
. gen c = "beer"
. global myvars "a,b,c"
. gen found1 = inlist("water", "$myvars")
. gen found2 = inlist("water", $myvars)
. list
+---------------------------------------+
| a b c found1 found2 |
|---------------------------------------|
1. | water wine beer 0 1 |
+---------------------------------------+
The first comparison is equivalent to
. di inlist("water", "a,b,c")
0
which finds no match, as "water" is not matched by the (single!) other argument.
Macro references are certainly allowed within function or command calls: as each macro name is replaced by its contents before the syntax is checked, the function or command never even knows that a macro reference was ever used.
As #Aspen Chen concisely points out, omitting the double quotes gives what you want so long as the inlist() syntax remains legal.
If your data structure is something like in the following example, you can try the egen function incss, from egenmore (ssc install egenmore):
clear
set more off
input ///
str15(amz2009 amz2010)
"water" "juice"
"milk" "water"
"lemonade" "wine"
"water & beer" "tea"
end
list
egen watindic = incss(amz*), sub(water)
list
Be aware it searches for substrings (see the result for the last example observation).
A solution with a loop achieving different results is:
gen watindic2 = 0
forvalues i = 2009/2010 {
replace watindic2 = 1 if amz`i' == "water"
}
list
Another solution involves reshape, but I'll leave it at that.
I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.
I just started working on a massive dataset with 5 million observations and lots and lots of variables. To process this faster, I want to select only some variables of interest and drop the rest.
with keep, I could select a block of variables, very simple:
keep varx1-x5
However, the variables I want are not in order in the dataset:
varx1 varx2 varx3 varz1 varz2 vary1 vary2 vary3
Where I don't want the varz variables. I want only the blocks with varx and vary.
So. I'm not very good at loops, but I tried this:
foreach varname of varlist varx1-varx3 vary1-vary3 {
keep `varname'
}
This doesn't work, because it keeps only varx1, then tries to keep the others, and errors out because they have just been dropped.
How can I tell keep to select multiple blocks of variables?
Rather than using keep which will wipe out variables not given to the command, try drop, which will delete only those you specify. The loop is not necessary. An example:
clear
set obs 0
*----- example vars -----
gen varx1 = .
gen varx2 = .
gen varx3 = .
gen varz1 = .
gen varz2 = .
gen vary1 = .
gen vary2 = .
gen vary3 = .
*----- what you want -----
drop varz*
Both commands are documented jointly, so help keep or help drop would have gotten you there.
If you don't know all the variables you want to drop, to keep only the blocks with varx and vary :
keep varx* varz*
The * means “match zero or more” of the preceding expression.