Compute year on year growth of GDP for each quarter - stata

I want to compute in Stata year on year growth rate of the GDP for each quarter. Basically, I want to compute : (gdp_q1y1980-gdp_q1y1979)/gdp_q1y1979.

// create some silly example data
clear
set obs 10
gen time = _n
format time %tq
gen gdp = _n^2*100
// do the computation
tsset time
gen growth = S4.gdp / L4.gdp
// admire the result
list
For more information see here.

I prefer #Maarten Buis' solution but know also that you could use subscripting:
sort time
gen growth = (gdp / gdp[_n-4]) - 1
Run help subscripting for details (or http://www.stata.com/help.cgi?subscripting).
Note that extra care must be taken if, for example, there is a gap in your time series:
. clear all
. set more off
.
. // create some silly example data
. set obs 15
obs was 0, now 15
. gen time = _n
. format time %tq
. gen gdp = _n^2*100
.
. // create a gap deleting 1962q4
. drop in 11
(1 observation deleted)
.
. // using -tsset-
. tsset time
time variable: time, 1960q2 to 1963q4, but with a gap
delta: 1 quarter
. gen growth = S4.gdp / L4.gdp
(5 missing values generated)
.
. // subscripting
. sort time
. gen growth2 = (gdp / gdp[_n-4]) - 1
(4 missing values generated)
.
. list, separator(0)
+--------------------------------------+
| time gdp growth growth2 |
|--------------------------------------|
1. | 1960q2 100 . . |
2. | 1960q3 400 . . |
3. | 1960q4 900 . . |
4. | 1961q1 1600 . . |
5. | 1961q2 2500 24 24 |
6. | 1961q3 3600 8 8 |
7. | 1961q4 4900 4.444445 4.444445 |
8. | 1962q1 6400 3 3 |
9. | 1962q2 8100 2.24 2.24 |
10. | 1962q3 10000 1.777778 1.777778 |
11. | 1963q1 14400 1.25 1.938776 |
12. | 1963q2 16900 1.08642 1.640625 |
13. | 1963q3 19600 .96 1.419753 |
14. | 1963q4 22500 . 1.25 |
+--------------------------------------+
The results for the solution with subscripts (variable growth2) are messed up once the gap begins (1963q1). A good reason, I think, to prefer tsset.

Related

Optimal lag selection in Granger Causality tests

I use [TS] varsoc to obtain the optimum lag length for the Granger causality test in Stata. This command reports the optimal number of lags based on different criteria such as Akaike's information criterion (AIC).
Is there any way to store the optimal lag number (obtained based on AIC) in a variable and use it in the next command to estimate causality? Something like this:
Lag= varsoc X Y
tvgc X Y, p(Lag) d(Lag) trend window(30) prefix(_) graph
Here I adapt the first example in the help for varsoc. You can sort the matrix of statistics so that minimum AIC is in the first row, and read off the lag concerned.
. webuse lutkepohl2, clear
(Quarterly SA West German macro data, Bil DM, from Lutkepohl 1993 Table E.1)
. varsoc dln_inv dln_inc dln_consump
Lag-order selection criteria
Sample: 1961q2 thru 1982q4 Number of obs = 87
+---------------------------------------------------------------------------+
| Lag | LL LR df p FPE AIC HQIC SBIC |
|-----+---------------------------------------------------------------------|
| 0 | 696.398 2.4e-11 -15.9402 -15.9059 -15.8552* |
| 1 | 711.682 30.568 9 0.000 2.1e-11 -16.0846 -15.9477* -15.7445 |
| 2 | 724.696 26.028 9 0.002 1.9e-11* -16.1769* -15.9372 -15.5817 |
| 3 | 729.124 8.8557 9 0.451 2.1e-11 -16.0718 -15.7294 -15.2215 |
| 4 | 738.353 18.458* 9 0.030 2.1e-11 -16.0771 -15.632 -14.9717 |
+---------------------------------------------------------------------------+
* optimal lag
Endogenous: dln_inv dln_inc dln_consump
Exogenous: _cons
.
. mata
------------------------------------------------- mata (type end to exit) ---------------
: stats = st_matrix("r(stats)")
: _sort(stats, 7)
: st_numscalar("opt_lag_AIC", stats[1,1])
: end
-----------------------------------------------------------------------------------------
.
. di opt_lag_AIC
2
To plug into a later command automatically, use expressions like
`=opt_lag_AIC'
as arguments to options.

DAX Cumulative Sum, reset by x period

I have a function for a cumulative sum as follows:
Actions closed cumulative =
IF(MAX(Dates[Date])<CALCULATE(MIN(Dates[Date]),
FILTER(all(Dates),Dates[PeriodFiscalYear]=[CurrentPeriod1])),
CALCULATE(Actions[Actions closed on time],
FILTER(ALL(Dates),
Dates[Date]<=max(Dates[Date])
)
))
Where CurrentPeriod1 is the period we're in, which returns something like this:
PeriodFiscalYear | Actions | Actions Closed Cumulative
P01-2018/19 | 4 | 608
P02-2018/19 | 19 | 627
P03-2018/19 | 17 | 644
P04-2018/19 | 6 | 650
P05-2018/19 | 7 | 657
So it's basically counting all the actions closed in the table at the moment but I'd like to reset on a certain number of periods, for example 3 periods would be:
PeriodFiscalYear | Actions | Actions Closed Cumulative
P12-2017/18 | 10 |
P13-2017/18 | 10 |
P01-2018/19 | 4 | 24
P02-2018/19 | 19 | 33
P03-2018/19 | 17 | 40
P04-2018/19 | 6 | 42
P05-2018/19 | 7 | 30
I'm struggling to understand how to do it, despite quite a lot of reading. I have a calendar table with dates by 13 periods per year and also pretty much every measure you could think of, month, monthyear, monthperiod etc etc. Any help would be appreciated. Ultimate goal is a moving average over a set number of periods.
Thanks
Assuming your Actions table uses date values and the periods are stored in the Dates table, I suggest first creating an index column for the periods to that they are much easier to work with:
PeriodIndex = 100 * VALUE(MID(Dates[PeriodFiscalYear], 5, 4)) +
VALUE(MID(Dates[PeriodFiscalYear], 2, 2))
These index values should be integers that look like 201804 instead of P04-2018/19, for example.
Now that you have the index column to work with, you can write a rolling cumulative sum like this:
Trailing3Periods =
VAR CurrentPeriod = MAX(Dates[PeriodIndex])
RETURN CALCULATE(SUM(Actions[Actions closed on time]),
FILTER(ALL(Dates),
Dates[PeriodIndex] <= CurrentPeriod &&
Dates[PeriodIndex] > CurrentPeriod - 3))

Chi-Sq test result difference when done Manually and by SAS

I am trying to perform a chi-square test on my data using SAS University Edition.
Here is the strucure of my data
+----------+------------+------------------+-------------------+
| study_id | Control_id | study_mortality | control_mortality |
+----------+------------+------------------|-------------------+
| 1 | 50 | Alive | Alive |
| 1 | 52 | Alive | Alive |
| 2 | 65 | Dead | Dead |
| 2 | 70 | Dead | Alive |
+----------+------------+------------------+-------------------+
I am getting different results when I do the test with SAS Vs when I do it manually using an online calculator. I used the values from 'PROC FREQ' to calculate the Chi-Sq using online calculator. Here are the outputs of frequencies and the Chi-sq test. Can someone point where the issue is.
proc freq data = mydata;
tables study_mortality control_mortality;
where type=1;
run;
+-----------------+-------------------+
| study_mortality | Frequency |
+-----------------+-------------------
| Alive | 7614 |
| Dead | 324 |
+-----------------+-------------------+
+----------------- +-------------------+
| control_mortality| Frequency |
+----------------- +-------------------
| Alive | 6922 |
| Dead | 159 |
+----------------- +-------------------+
proc freq data = mydata;
tables study_mortality*control_mortality/ CHISQ;
where type=1;
run;
+-----------------+-------------------+---------+-------+
| | Control_mortality | | |
+-----------------+-------------------+---------+-------+
| Study_mortality | Alive | Dead | Total |
| Alive | 5515 | 134 | 5649 |
| Dead | 249 | 5 | 254 |
| Total | 5764 | 139 | 5903 |
+-----------------+-------------------+---------+-------+
Statistic DF Value Prob
Chi-Square 1 0.1722 0.6782
Likelihood Ratio Chi-Square 1 0.1818 0.6699
Continuity Adj. Chi-Square 1 0.0414 0.8388
Mantel-Haenszel Chi-Square 1 0.1722 0.6782
Phi Coefficient -0.0054
Contingency Coefficient 0.0054
Cramer's V -0.0054
You have missing data. Look at the N's on those tables.
Study Mortality is around 8000 and Control Mortality is around 7000 but when you cross them you only have 5903 records. This means that certain records are excluded. There should be a line in the output saying N missing somewhere. Not sure if SAS didn't put it there or you only pasted selected output. The P value matches exactly when I use an online calculator and also match your output.
data have;
infile cards;
input Study Control N;
cards;
1 1 5515
1 0 134
0 1 249
0 0 5
;
run;
proc freq data=have;
table study*control / chisq;
weight N;
run;

Convert one to many with 2 digits

I am currently handling a data set in Stata generated through ODK, the open data kit.
There is an option to answer questions with multiple answers. E.g. in my questionnaire "Which of these assets do you own?" and the interviewer tagged all the answers out of 20 options.
This generated for me a string variable with contents such as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
As this is difficult to analyse for several hundred participants, I wanted to generate new variables creating a 1 or 0 for each of the answer options.
For the variable hou_as I tried to generate the variables hou_as_1, hou_as_2 etc. with the following code:
foreach p in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 {
local P : subinstr local p "-" ""
gen byte hou_as_`P' = strpos(hou_as, "`p'") > 0
}
For the single digits this brings the problem that the variable hou_as_1 is also filled with a 1 if any of the 10 11 12 ... 19 is filled even if the option 1 was not chosen. Similarly hou_as_2 is filled when the option 2, 12 or 20 is checked.
How can I avoid this issue?
You want 20 indicator or dummy variables. Note first that it's much easier to use forval to loop 1(1)20, e.g.
forval j = 1/20 {
gen hou_as_`j' = 0
}
initialises 20 such variables as 0.
I think it's easier to loop over the words of your answer variables, words being here just whatever is separated by spaces. There are at most 20 words, and it is a little crude but likely to be fast enough to go
forval j = 1/20 {
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Let's put that together and try it out on your example:
clear
input str42 hou_as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
end
forval j = 1/20 {
gen hou_as_`j' = 0
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Just to show that it worked:
. list in 3
+----------------------------------------------------------------------------+
3. | hou_as | hou_as_1 | hou_as_2 | hou_as_3 | hou_as_4 | hou_as_5 | hou_as_6 |
| 1 3 9 11 | 1 | 0 | 1 | 0 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_as_7 | hou_as_8 | hou_as_9 | hou_a~10 | hou_a~11 | hou_a~12 | hou_a~13 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_a~14 | hou_a~15 | hou_a~16 | hou_a~17 | hou_a~18 | hou_a~19 | hou_a~20 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------------------------------------------+
Incidentally, your line
local P : subinstr local p "-" ""
does nothing useful. The local macro p only ever has contents which are integer digits, so there is no punctuation at all to remove.
See also this explanation and
. search multiple responses, sj
Search of official help files, FAQs, Examples, SJs, and STBs
SJ-5-1 st0082 . . . . . . . . . . . . . . . Tabulation of multiple responses
(help _mrsvmat, mrgraph, mrtab if installed) . . . . . . . . B. Jann
Q1/05 SJ 5(1):92--122
introduces new commands for the computation of one- and
two-way tables of multiple responses
SJ-3-1 pr0008 Speaking Stata: On structure & shape: the case of mult. resp.
. . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox & U. Kohler
Q1/03 SJ 3(1):81--99 (no commands)
discussion of data manipulations for multiple response data

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)