In python, one can loop over a list of values and access their respective indexes using enumerate. For example:
values=['var2006','var2007','var2008','var2009','var2010','var2011']
for index, value in enumerate(values, start=1):
print(value.replace(value[-4:],str(index)))
Which returns:
var1
var2
var3
var4
var5
var6
I would like to do something similar in Stata. Specifically, I have a list of variables like 'var2006','var2007','var2008','var2009','var2010','var2011' and I would like to rename them to 'var1','var2','var3','var4','var5','var6'. I'm trying to use a combination of foreach and rename but if there is something similar to python enumerate it will solve this for me.
You can accomplish this by using the rename command with the renumber and sort options:
rename var# var#, renumber sort
Alternatively, if you want to refer to each variable individually, you could also use forvalues loop and a local macro. This would be more similar to Python's enumerate as you could refer to one variable at a time:
clear
set obs 100
gen var2006 = 1
gen var2007 = 2
gen var2008 = 3
gen var2009 = 4
gen var2010 = 5
gen var2011 = 6
forvalues i = 2006/2011 {
local j = `i' - 2005
rename var`i' var`j'
}
Note that the local macro j is equal to the year minus 2005, and you could set this to be any number to change the names of the variables.
Related
In Stata, I am trying to change the values--both string and numeric--of one row based on the one just above or just below it only if the values are missing. Here are some sample data:
input
str40 id var1 var2 var3 var4 str40 var5_string str40 var6_string
"correctly-spelled" 10 20 . . "random text 1" ""
"misspelled" . . 30 40 "" "random text 2"
end
Essentially, I want my final dataset to look as follows:
input
id var1 var2 var3 var4 var5_string var6_string
"correctly-spelled" 10 20 30 40 "random text 1" "random text 2"
end
I need a row-specific solution (i.e. avoiding collapse), because my (wide) dataset has thousands of labeled variables, and I don't want to lose the labels due to collapse. Also, not all of the variables are numeric, and the naming conventions of the variables are not consistent. Accordingly, fixing the spelling of id with a simple replace, executing a collapse (firstnm) id var5_string var6_string (mean) var1 var2 var3 var4, by(id), or using var* for anything won't help. Basically, what happened was one person merged using the "correctly-spelled" id, the other person merged using the "misspelled" id, and I don't have any of the source files. Thanks!
If you can assume that the misspelled ID comes right after (or right before) the correctly spelled, you can use _n±1 to get the previous or following value. For more information on system variables see help _variables
If you assume the correct one always comes first, then the second replace would be sufficient.
mi() is the abbreviated missing() function.
the second conditions & !mi(var'[_n±1])`, are just to make sure that non-missing don't get replaced by missing values, should two valid (but different) ID's come up sequentially. Depending on your data, this further condition might not be necessary.
local list_of_vars var1 var2 var3 var4 var5_string var6_string
foreach var of local list_of_vars {
replace `var' = `var'[_n-1] if mi(`var') & !mi(`var'[_n-1])
replace `var' = `var'[_n+1] if mi(`var') & !mi(`var'[_n+1])
}
. list
+-------------------------------------------------------------------------------+
| id var1 var2 var3 var4 var5_string var6_string |
|-------------------------------------------------------------------------------|
1. | correctly-spelled 10 20 30 40 random text 1 random text 2 |
2. | misspelled 10 20 30 40 random text 1 random text 2 |
+-------------------------------------------------------------------------------+
Then just keep the correct ones. Hopefully you can identify them somehow.
// the following is just to be able to identify the correct id's, of course you will have to adapt it so that it matches only the correctly-spelled IDs or you have other way of identifying them :)
gen _ck_corect_id = (id=="correctly-spelled")
keep if _ck_corect_id==1
I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}
I have three different variables in Stata, var1, var2, and var3.
I need to make a summary table of these three variables so that I have the observation number, mean, sd, min, max as the fields in the resulting summary table.
I am using the following code :
su var1 if restriction == 2
su var2 if restriction == 3
su var3 if restriction == 4
Since the summary table is created from variables that are applied with restrictions, I am unable to use :
su var1 var2 var3
I would be very grateful if anyone has any ideas on how to modify my code so that instead of three lines of code I can use one line of code to get a single table will all the stats I require, which I can then copy as a table into my Word document.
Nothing reproducible here without example data. Please study https://stackoverflow.com/help/mcve
But I would go
gen var1_2 = var1 if restriction == 2
gen var2_3 = var2 if restriction == 3
gen var3_4 = var3 if restriction == 4
summarize var1_2 var2_3 var3_4
I want to use the local command in Stata to store several variables that I afterwards want to export as two subsamples. I separate the dataset by the grouping variable grouping_var, which is either 0 or 1. I tried:
if grouping_var==0 local vars_0 var1 var2 var3 var4
preserve
keep `vars_0'
saveold "data1", replace
restore
if grouping_var==1 local vars_1 var1 var2 var3 var4
preserve
keep `vars_1'
saveold "data2", replace
restore
However, the output is not as I expected and the data is not divided into two subsamples. The first list includes the whole dataset. Is there anything wrong in how I use the if statement here?
There is a bit of confusion between the "if qualifier" and the "if command" here. The syntax if (condition) (command) is the "if command", and generally does not provide the desired behavior when written using observation-level logical conditions.
In short, Stata evaluates if (condition) for the first observation, which is why your entire data set is being kept/saved in the first block (i.e., in your current sort order, grouping_var[1] == 0). See http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for more information.
Assuming you want to keep different variables in each case, something like the code below should work:
local vars_0 var1 var2 var3 var4
local vars_1 var5 var6 var7 var8
forvalues g = 0/1 {
preserve
keep if grouping_var == `g'
keep `vars_`g''
save data`g' , replace
restore
}
I have panel data (time: date, name: ticker). I want to create 10 lags for variables x and y. Now I create each lag variable one by one using the following code:
by ticker: gen lag1 = x[_n-1]
However, this looks messy.
Can anyone tell me how can I create lag variables more efficiently, please?
Shall I use a loop or does Stata have a more efficient way of handling this kind of problem?
#Robert has shown you the streamlined way of doing it. For completion, here is the "traditional", boring way:
clear
set more off
*----- example data -----
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
list, sepby(id)
*----- what you want -----
// "traditional" loop
forvalues i = 1/10 {
gen x_`i' = L`i'.x
gen y_`i' = L`i'.y
}
list, sepby(id)
And a combination:
// a combination
foreach v in x y {
tsrevar L(1/10).`v'
rename (`r(varlist)') `v'_#, addnumber
}
If the purpose is to create lagged variables to use them in some estimation, know you can use time-series operators within many estimation commands, directly; that is, no need to create the lagged variables in the first place. See help tsvarlist.
You can loop to do this but you can also take advantage of tsrevar to generate temporary lagged variables. If you need permanent variables, you can use rename group to rename them.
clear
set obs 2
gen id = _n
expand 20
bysort id: gen time = _n
tsset id time
set seed 12345
gen x = runiform()
gen y = 10 * runiform()
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
tsrevar L(1/10).y
rename (`r(varlist)') y_#, addnumber
Note that if you are doing this to calculate a statistic on a rolling window, check out tsegen (from SSC)