Is there a way to extract year range from wide data? - stata

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.

Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

Related

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

Is it possible to invoke a global macro inside a function in Stata?

I have a set of variables the list of which I have saved in a global macro so that I can use them in a function
global inlist_cond "amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss"
The reason why they are saved in a macro is because the list will be in a loop and its content will change depending on the year.
What I need to do is to generate a dummy variable so that water_dummy == 1 if any of the variables in the macro list has the WATER classification. In Stata, I need to write
gen water_dummy = inlist("WATER", "$inlist_cond")
, which--ideally--should translate to
gen water_dummy = inlist("WATER", amz2002ras_clss, amz2003ras_clss, amz2004ras_clss, amz2005ras_clss, amz2006ras_clss, amz2007ras_clss, amz2008ras_clss, amz2009ras_clss, amz2010ras_clss, amz2011ras_clss)
But this did not work---the code executed without any errors but the dummy variable only contained 0s. I know that it is possible to invoke macros inside functions in Stata, but I have never tried it when the macro contains a whole list of conditions. Any thoughts?
With a literal string specified, which the double quotes in the generate statement insist on, then you are comparing text with text and the comparison is not with the data at all.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen a = "water"
. gen b = "wine"
. gen c = "beer"
. global myvars "a,b,c"
. gen found1 = inlist("water", "$myvars")
. gen found2 = inlist("water", $myvars)
. list
+---------------------------------------+
| a b c found1 found2 |
|---------------------------------------|
1. | water wine beer 0 1 |
+---------------------------------------+
The first comparison is equivalent to
. di inlist("water", "a,b,c")
0
which finds no match, as "water" is not matched by the (single!) other argument.
Macro references are certainly allowed within function or command calls: as each macro name is replaced by its contents before the syntax is checked, the function or command never even knows that a macro reference was ever used.
As #Aspen Chen concisely points out, omitting the double quotes gives what you want so long as the inlist() syntax remains legal.
If your data structure is something like in the following example, you can try the egen function incss, from egenmore (ssc install egenmore):
clear
set more off
input ///
str15(amz2009 amz2010)
"water" "juice"
"milk" "water"
"lemonade" "wine"
"water & beer" "tea"
end
list
egen watindic = incss(amz*), sub(water)
list
Be aware it searches for substrings (see the result for the last example observation).
A solution with a loop achieving different results is:
gen watindic2 = 0
forvalues i = 2009/2010 {
replace watindic2 = 1 if amz`i' == "water"
}
list
Another solution involves reshape, but I'll leave it at that.

Using Stata's keep command on multiple blocks of variables

I just started working on a massive dataset with 5 million observations and lots and lots of variables. To process this faster, I want to select only some variables of interest and drop the rest.
with keep, I could select a block of variables, very simple:
keep varx1-x5
However, the variables I want are not in order in the dataset:
varx1 varx2 varx3 varz1 varz2 vary1 vary2 vary3
Where I don't want the varz variables. I want only the blocks with varx and vary.
So. I'm not very good at loops, but I tried this:
foreach varname of varlist varx1-varx3 vary1-vary3 {
keep `varname'
}
This doesn't work, because it keeps only varx1, then tries to keep the others, and errors out because they have just been dropped.
How can I tell keep to select multiple blocks of variables?
Rather than using keep which will wipe out variables not given to the command, try drop, which will delete only those you specify. The loop is not necessary. An example:
clear
set obs 0
*----- example vars -----
gen varx1 = .
gen varx2 = .
gen varx3 = .
gen varz1 = .
gen varz2 = .
gen vary1 = .
gen vary2 = .
gen vary3 = .
*----- what you want -----
drop varz*
Both commands are documented jointly, so help keep or help drop would have gotten you there.
If you don't know all the variables you want to drop, to keep only the blocks with varx and vary :
keep varx* varz*
The * means “match zero or more” of the preceding expression.

stata - variable operations conditional to existent vars and to a list of varnames

I have this problem.
My dataset has variables like:
sec20_var1 sec22_var1 sec30_var1
sec20_var2 sec22_var2 sec30_var2 sec31_var2
(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)
My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.
The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:
set more off
foreach v in _var02 {
ds *`v'
di "`r(varlist)'"
local sects`v' `r(varlist)'
foreach s in sec26 sec28 sec37 {
capture confirm local sects`v'
if !_rc {
egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')
else {
di "`v' didnt existed"
}
}
}
}
I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.
I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.
Thanks for your help! :)
Thank you #SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).
I used this code, that goes in the same direction that the one made by #SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.
*COMPLETING THE LIST OF COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 _... {
foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 {
capture confirm variable s'v'
if _rc==111 {
gen s'v'=.
}
}
}
*GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 ... {
egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v')
}
*DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR.
set more off
foreach v of varlist * {
gen TEMP=.
replace TEMP=1 if !missing(v')
egen TEMPSUM=sum(TEMP)
if TEMPSUM==0 {
di " >>> Dropping empty variable:v'"
drop `v'
}
drop TEMP TEMPSUM
}
Note that I cutted the list of variables.
I will call what you are referring to as variables as "accounts".
The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:
g start=.
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
g end=.
[indicator calculations go here]
drop start-end
However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).

Getting unknown function mean() in a forvalues loop

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.