I want to calculate growth rates in Stata for observations having the same ID. My data looks like this in a simplified way:
ID year a b c d e f
10 2010 2 4 9 8 4 2
10 2011 3 5 4 6 5 4
220 2010 1 6 11 14 2 5
220 2011 6 2 12 10 5 4
334 2010 4 5 4 6 1 4
334 2011 5 5 4 4 3 2
Now I want to calculate for each ID growth rates from variables a-f from 2010 to 2011:
For e.g ID 10 and variable a it would be: (3-2)/2, for variable b: (5-4)/4 etc. and store the results in new variables (e.g. growth_a, growth_b etc).
Since I have over 120k observations and around 300 variables, is there an efficient way to do so (loop)?
My code looks like the following (simplified):
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
FYI: variables a-f are numeric.
But Stata says: 'local not found' and I am not sure whether the code is correct. Do I also have to sort for year first?
The specific error in
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
is an error in the syntax of foreach, which here expects syntax like foreach x of local variables, given your prior use of a local macro. With the keyword in, foreach takes the word local literally and here looks for a variable with that name: hence the error message. This is basic foreach syntax: see its help.
This code is problematic for further reasons.
Sorting on ID does not guarantee the correct sort order, here time order by year, for each distinct ID. If observations are jumbled within ID, results will be garbage.
The code assumes that all time values are present; otherwise the time gap between observations might be unequal.
A cleaner way to get growth rates is
tsset ID year
foreach x in a b c d e f {
gen `x'_gr = D.`x'/L.`x'
}
Once you have tsset (or xtset) the time series operators can be used without fear: correct sorting is automatic and the operators are smart about gaps in the data (e.g. jumps from 1982 to 1984 in yearly data).
For more variables the loop could be
foreach x of var <whatever> {
gen `x'_gr = D.`x'/L.`x'
}
where <whatever> could be a general (numeric) varlist.
EDIT: The question has changed since first posting and interest is declared in calculating growth rates only from 2010 to 2011, with the implication in the example that only those years are present. The more general code above will naturally still work for calculating those growth rates.
Related
My data is currently organized in Stata as follows:
input str2 Country gdp_2015 gdp_2016 gdp_2017 imports_2016 imports_2017 exports_2016
"A" 11 12 13 5 6 8 5
"B" 11 . . 5 6 10 5
"C" 12 13 . 5 6 8 5
end
gen net_imports = (imports_2017-foodexport_2017)
gen net_imports_toGDP = (net_imports/gdpcurrent_2017)
The code works well but only created a variable if a country has 2017 data, but I would like to essentially create an import to GDP ratio, based on the most recent observation available for GDP.
You could simply replace the missing data as follows:
replace gdp_2016 = gdp_2015 if mi(gdp_2016)
replace gdp_2017 = gdp_2016 if mi(gdp_2017)
However, a more general approach would begin by reshaping your data from wide to long:
reshape long gdp_ imports_ exports_, i(Country)
See help reshape for more detail on the command. The gdp_ etc. are the stubs that will be the new variable names, and i(Country) sets the identifier.
Then you can fill forward within each observation using time-series variables:
encode Country, generate(Country_num
xtset Country_num _j
replace gdp_=l.gdp_ if mi(gdp_) & !mi(l.gdp_)
This is a problem that I have never encountered before, hence, I don't even know where to start.
I have an unbalanced panel data set (different products sold at different stores across weeks) and would like to run correlations on sales between each product combination. The requirement is, however, a correlation is only to be calculated using the sales values of two products appearing together in the same store and week. That is to say, some weeks or some stores may sell only either of the two given products, so we just want to disregard those instances.
The number of observations in my data set is 400,000 but among them I have only 50 products sold, so the final correlation matrix would be 50*50=2500 with 1250 unique correlation values. Does it makes sense?
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
The correlation table should be something like this [fyi, instead of the correlation values I put square brackets to illustrate the values to be used]. Please note that I cannot run a correlation for AF because there is only one store/week combination.
A B C
A 1 [10,20; 10,30] [10,23; 10,30]
B 1 [20,23; 30,30]
C 1
You calculate correlations between pairs of variables; but what you regard as pairs of variables are not so in the present data layout. So, you need a reshape. The principle is shown by
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
reshape wide sales , i(store week) j(product) string
rename sales* *
list
+----------------------------------+
| store week A B C F |
|----------------------------------|
1. | 1 1 10 20 23 . |
2. | 2 1 10 30 30 43 |
+----------------------------------+
pwcorr A-F
| A B C F
-------------+------------------------------------
A | .
B | . 1.0000
C | . 1.0000 1.0000
F | . . . .
The results look odd only because your toy example won't allow otherwise. So A doesn't vary in your example and the correlation isn't defined. The correlation between B and C is perfect because there are two data points different in both B and C.
A different problem is that a 50 x 50 correlation matrix is unwieldy. How to get friendlier output depends on what you want to use it for.
I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+
I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.
I have the following data matrix containing ideology scores in a customized dataset:
year state cdnum party name dwnom1
1946 23 10 200 WOODRUFF 0.43
1946 23 11 200 BRADLEY F. 0.534
1946 23 11 200 POTTER C. 0.278
1946 23 12 200 BENNETT J. 0.189
My unit of analysis is a given congressional district, in a given year. As one can see state #23, cdnum #11, has two observations in 1946.
What I would like to do is delete the earlier observation, in this case the observation corresponding to name: BRADLEY.F. This happens when a Congressional district has two members in a given Congress. The attempt of code that I have tried is as follows:
drop if year==[_n+1] & statenum==[_n+1] & cdnum==[_n+1]
My attempt is a conditional argument, drop the observation if: the year is the same as the next observation, the statenum is the same as the next observation, and the cdnum is the same as the next observation. In this way, I can insure each district has only one corresponding for a given year. When I attempt to run the code I get:
drop if year==[_n-1] & statenum==[_n-1] & cdnum==[_n-1]
(0 observations deleted)
Brief alternative: You should check out the duplicates command.
Detailed explanation of error:
You don't mean what you say to Stata.
Your conditions such as
if year == [_n-1]
should be
if year == year[_n-1]
and so forth.
[_n-1]
by itself is treated as if you typed
_n-1
which is the observation number, minus 1.
Here is a dopey example. Read in the auto data.
. sysuse auto
(1978 Automobile Data)
. list foreign if foreign == [_n-1], nola
+---------+
| foreign |
|---------|
1. | 0 |
+---------+
The variable foreign is equal to _n - 1 precisely once, in observation 1 when foreign is 0 and _n is 1.
In short, [_n-1] is not to be interpreted as the previous value (of the variable I just mentioned).
help subscripting gives very basic help.