I just started working on a massive dataset with 5 million observations and lots and lots of variables. To process this faster, I want to select only some variables of interest and drop the rest.
with keep, I could select a block of variables, very simple:
keep varx1-x5
However, the variables I want are not in order in the dataset:
varx1 varx2 varx3 varz1 varz2 vary1 vary2 vary3
Where I don't want the varz variables. I want only the blocks with varx and vary.
So. I'm not very good at loops, but I tried this:
foreach varname of varlist varx1-varx3 vary1-vary3 {
keep `varname'
}
This doesn't work, because it keeps only varx1, then tries to keep the others, and errors out because they have just been dropped.
How can I tell keep to select multiple blocks of variables?
Rather than using keep which will wipe out variables not given to the command, try drop, which will delete only those you specify. The loop is not necessary. An example:
clear
set obs 0
*----- example vars -----
gen varx1 = .
gen varx2 = .
gen varx3 = .
gen varz1 = .
gen varz2 = .
gen vary1 = .
gen vary2 = .
gen vary3 = .
*----- what you want -----
drop varz*
Both commands are documented jointly, so help keep or help drop would have gotten you there.
If you don't know all the variables you want to drop, to keep only the blocks with varx and vary :
keep varx* varz*
The * means “match zero or more” of the preceding expression.
Related
I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).
I often find myself needing to check whether or not variables are constant within a group. This is how I currently go about this (assume that the group is defined by a-b-c and the variable in question is var):
bys a b c (var): gen isconstant=var[1]==var[_N]
*manually inspect the results of the below tabulation; if all 1's, then it is constant
tab isconstant
drop isconstant
(Note that the above approach assumes that there are no missing observations within a group. I would have to think more about how to approach it if there were missings. And instead of manually checking, could use something along the lines of assert.)
This works fine, but is there a more succinct way to do this? Perhaps a one line solution, roughly analogous to isid ..., but of course checking for something else.
The principle behind your approach is also explained in this FAQ but I am not aware of a dedicated command. Still, it is programmable and you are a programmer, so where is yours?
Here is a quick stab:
*! 1.0.0 NJC 2 March 2020
program homog, sortpreserve
version 8
syntax varname [if] [in] [, MISSing BY(varlist) ]
* missings are ignored by default
if "`missing'" == "" {
marksample touse, strok
if "`by'" != "" markout `touse' `by', strok
}
else marksample touse, novarlist
tempvar OK
bysort `touse' `by' (`varlist') : gen byte `OK' = `varlist'[1] == `varlist'[_N]
quietly summarize `OK' if `touse'
if r(min) == 0 display as err "assertion is false"
end
and some silly examples:
. sysuse auto, clear
(1978 Automobile Data)
. homog mpg
assertion is false
. homog rep78, by(rep78)
. gen one = 1
. homog one
. replace one = . in L
(1 real change made, 1 to missing)
. homog one
. homog one, missing
assertion is false
So, the principles are
No news is good news. The only possible output, other than error messages, is a message "assertion is false". This isn't treated as an error. If your taste runs otherwise, clone the program, rename it and change the way it works.
by() is an option and if specified causes all comparisons to be by the distinct groups of observations so identified.
Missings are ignored by default. The option missing changes that so that for example 42 and missing are reported as different. This applies also to missing values of any by() variables.
This question has been edited to add sample data and clean-up (hopefully) some unnecessary steps per feedback.
I am starting with longitudinal data in wide format. I need to subset, reshape, and perform summary steps for multiple different chunks of data. I want to create macro variables with varlists needed for reshaping and other repetitive steps in wide and long format. The variables being reshaped follow a consistent naming pattern of (prefix)_(name)_#. There are also variables following the same pattern that do not need to be reshaped, and variables that are time-invariant and follow other naming conventions. To generate sample data:
set obs 1
foreach t in 0 6 15 18 21 {
foreach w in score postint postintc constime starttime {
gen p_`w'_`t' = 1
}
}
gen p_miles_0 = 1
gen p_hea_0 = 1
gen cons_age = 1
ds
I want to create two macro vars 1) wide_varlist for wide format data where the variables end in a number and 2) uniquestubs for long format data where the macro list contains just the stubs. I am having trouble using the macro list extended function "uniq" to generate #2 here. Here is my code so far. My full varlists are actually much longer.
Steps to create macro with wide format varlist:
/* create varlist for wide format data a time point 0,6,15,18,21 */
ds p_score_* p_postint_* p_postintc_* p_constime_* p_starttime_*
di "`r(varlist)'"
global wide_varlist `r(varlist)'
Start steps to create macro with long format varlist:
/*copy in wide format varlist*/
global stubs "$wide_varlist"
/*remove # - this results in a macro with 5 dups of same stub*/
foreach mo of numlist 0,6,15,18,21{
global stubs : subinstr global stubs "`mo'" "", all
}
/*keep unique stubs*/
global uniquestubs : list uniq stubs
Everything above works as I intend until global uniquestubs : list uniq stubs, which doesn't create the macro uniquestubs at all.
My situation seems similar to this this question but the same solution didn't work for me.
Any thoughts? Appreciate the help.
It's a bit difficult to follow what you are trying to do (a) without a reproducible example (b) because much of your code is just copying the same varlist to different places, which is a distraction.
We can fix (a) by creating a toy dataset:
clear
set obs 1
foreach t in 0 6 15 18 21 {
foreach w in score postint postintc constime starttime {
gen p_`w'_`t' = 1
}
}
ds
p_score_0 p_score_6 p_score_15 p_score_18 p_score_21
p_postint_0 p_postint_6 p_postint_15 p_postint_18 p_postint_21
p_postintc_0 p_postintc_6 p_postintc~5 p_postintc~8 p_postintc~1
p_constime_0 p_constime_6 p_constim~15 p_constim~18 p_constim~21
p_starttim~0 p_starttim~6 p_startti~15 p_startti~18 p_startti~21
Now the main difficulty seems to be that you want stubs for a reshape long. This code suffices for the toy dataset. There is no need to scan yet more variable names with the same information. If you don't have all variables for all time points, you may need more complicated code.
unab stubs: p_*_0
local stubs : subinstr local stubs "0" "", all
di "`stubs'"
p_score_ p_postint_ p_postintc_ p_constime_ p_starttime_
I don't understand the enthusiasm for globals here, but, programming taste aside, you can put the last result in a global quite easily.
I have two large datasets (more than 1000 variables in each), one of which has all the variables of the second, plus additional variables. I would like to get a list of all these additional variables, and then drop them and append one dataset to another. I have tried the command dta_equal, but got the same problem found here: http://www.stata.com/statalist/archive/2011-08/msg00308.html
I guess append, keep() cannot realize what I want to do directly, i.e., cannot append dataset while drop additional variables since I have to manually type in variables one by one in the keep() option, which is not realistic given my large dataset.
Are there any ways to deal with this?
There are several Stata commands that can be useful here.
The unab command is used in the first example to make a list of variable in the dataset with fewer variables. The second and third example use the describe command to obtain the list of variables in a dataset not currently in memory.
The final part the the example shows how to use extended macro list functions to obtain a list of common variables and the set of variables not common to both datasets.
* simulate 2 datasets, one has more variables than the other
sysuse auto, clear
save "data1.dta", replace
gen x = _n
gen y = -_n
save "data2.dta", replace
* example 1: drop after append
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta"
replace source = 2 if mi(source)
keep `vcommon' source
* example 2: drop first then append
clear
describe using "data1.dta", varlist short
local vcommon `r(varlist)'
use `vcommon' using "data2.dta", clear
gen source = 2
append using "data1.dta"
replace source = 1 if mi(source)
* example 3: append and keep on the fly
use "data1.dta", clear
unab vcommon : *
gen source = 1
append using "data2.dta", keep(`vcommon')
replace source = 2 if mi(source)
* use extended macro list functions to manipulate variable list
clear
describe using "data1.dta", varlist short
local vlist1 `r(varlist)'
describe using "data2.dta", varlist short
local vlist2 `r(varlist)'
local vcommon : list vlist1 & vlist2
local vinonly1 : list vlist1 - vlist2
local vinonly2 : list vlist2 - vlist1
dis "common variables = `vcommon'"
dis "variables in data1 not found in data2 = `vinonly1'"
dis "variables in data2 not found in data1 = `vinonly2'"
I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.