Optimal lag selection in Granger Causality tests - stata

I use [TS] varsoc to obtain the optimum lag length for the Granger causality test in Stata. This command reports the optimal number of lags based on different criteria such as Akaike's information criterion (AIC).
Is there any way to store the optimal lag number (obtained based on AIC) in a variable and use it in the next command to estimate causality? Something like this:
Lag= varsoc X Y
tvgc X Y, p(Lag) d(Lag) trend window(30) prefix(_) graph

Here I adapt the first example in the help for varsoc. You can sort the matrix of statistics so that minimum AIC is in the first row, and read off the lag concerned.
. webuse lutkepohl2, clear
(Quarterly SA West German macro data, Bil DM, from Lutkepohl 1993 Table E.1)
. varsoc dln_inv dln_inc dln_consump
Lag-order selection criteria
Sample: 1961q2 thru 1982q4 Number of obs = 87
+---------------------------------------------------------------------------+
| Lag | LL LR df p FPE AIC HQIC SBIC |
|-----+---------------------------------------------------------------------|
| 0 | 696.398 2.4e-11 -15.9402 -15.9059 -15.8552* |
| 1 | 711.682 30.568 9 0.000 2.1e-11 -16.0846 -15.9477* -15.7445 |
| 2 | 724.696 26.028 9 0.002 1.9e-11* -16.1769* -15.9372 -15.5817 |
| 3 | 729.124 8.8557 9 0.451 2.1e-11 -16.0718 -15.7294 -15.2215 |
| 4 | 738.353 18.458* 9 0.030 2.1e-11 -16.0771 -15.632 -14.9717 |
+---------------------------------------------------------------------------+
* optimal lag
Endogenous: dln_inv dln_inc dln_consump
Exogenous: _cons
.
. mata
------------------------------------------------- mata (type end to exit) ---------------
: stats = st_matrix("r(stats)")
: _sort(stats, 7)
: st_numscalar("opt_lag_AIC", stats[1,1])
: end
-----------------------------------------------------------------------------------------
.
. di opt_lag_AIC
2
To plug into a later command automatically, use expressions like
`=opt_lag_AIC'
as arguments to options.

Related

How do I select the interaction coefficients to keep in Stata?

Similar to the question posed here, but I think I am not employing it correctly.
I used help fvvarlist to guide me on interactions.
I am employing a triple interaction with 3 binary variables:
As a toy model, let us assume:
x = gender (1 = male, 0 = female)
y = health (1 = good, 0 = poor)
z = employment (1 = employed, 0 = not employed)
using the following regression:
reg x##y##z if state == "NY" & year >1985
I am interested in the results for 1.x#1.y#1.z, but this coefficient is omitted.
1.x#1.y#1.z omitted because of collinearity
Is there a way I can keep this interaction?
It would be best to verify that you actually have this combination in your data with egen, group.
You should also use i. prefixes to keep Stata from treating your variables as continuous, which has the added benefit of a more informative error message: interaction identifies no observations in the sample rather than a mysterious collinearity one.
Here is a reproducible example:
. sysuse auto, clear
(1978 automobile data)
. sum mpg weight
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
weight | 74 3019.459 777.1936 1760 4840
. gen efficient = mpg > 21
. lab define efficient 0 "Inefficient" 1 "Efficient"
. lab val efficient efficient
. gen heavy = weight > 3e3
. lab define heavy 0 "Light" 1 "Heavy"
. lab val heavy heavy
. egen group = group(foreign efficient heavy), label(group)
. tab group, sort
group(foreign efficient |
heavy) | Freq. Percent Cum.
---------------------------+-----------------------------------
Domestic Inefficient Heavy | 34 45.95 45.95
Foreign Efficient Light | 15 20.27 66.22
Domestic Efficient Light | 13 17.57 83.78
Foreign Inefficient Light | 5 6.76 90.54
Domestic Efficient Heavy | 3 4.05 94.59
Domestic Inefficient Light | 2 2.70 97.30
Foreign Inefficient Heavy | 2 2.70 100.00
---------------------------+-----------------------------------
Total | 74 100.00
. reg price c.foreign##c.efficient##c.heavy, robust
note: c.foreign#c.efficient#c.heavy omitted because of collinearity.
Linear regression Number of obs = 74
F(6, 67) = 74.67
Prob > F = 0.0000
R-squared = 0.2830
Root MSE = 2606.9
-----------------------------------------------------------------------------------------------
| Robust
price | Coefficient std. err. t P>|t| [95% conf. interval]
------------------------------+----------------------------------------------------------------
foreign | 3007.6 960.0626 3.13 0.003 1091.307 4923.893
efficient | 513.2308 394.6504 1.30 0.198 -274.4948 1300.956
|
c.foreign#c.efficient | -1810.164 1071.875 -1.69 0.096 -3949.636 329.3076
|
heavy | 3283.176 696.5873 4.71 0.000 1892.782 4673.571
|
c.foreign#c.heavy | 2462.724 1196.996 2.06 0.044 73.50896 4851.938
|
c.efficient#c.heavy | -2783.741 744.4813 -3.74 0.000 -4269.732 -1297.75
|
c.foreign#c.efficient#c.heavy | 0 (omitted)
|
_cons | 3739 332.9212 11.23 0.000 3074.486 4403.514
-----------------------------------------------------------------------------------------------
. reg price i.foreign##i.efficient##i.heavy, robust
note: 1.foreign#1.efficient#1.heavy identifies no observations in the sample.
Linear regression Number of obs = 74
F(6, 67) = 74.67
Prob > F = 0.0000
R-squared = 0.2830
Root MSE = 2606.9
------------------------------------------------------------------------------------------
| Robust
price | Coefficient std. err. t P>|t| [95% conf. interval]
-------------------------+----------------------------------------------------------------
foreign |
Foreign | 3007.6 960.0626 3.13 0.003 1091.307 4923.893
|
efficient |
Efficient | 513.2308 394.6504 1.30 0.198 -274.4948 1300.956
|
foreign#efficient |
Foreign#Efficient | -1810.164 1071.875 -1.69 0.096 -3949.636 329.3076
|
heavy |
Heavy | 3283.176 696.5873 4.71 0.000 1892.782 4673.571
|
foreign#heavy |
Foreign#Heavy | 2462.724 1196.996 2.06 0.044 73.50896 4851.938
|
efficient#heavy |
Efficient#Heavy | -2783.741 744.4813 -3.74 0.000 -4269.732 -1297.75
|
foreign#efficient#heavy |
Foreign#Efficient#Heavy | 0 (empty)
|
_cons | 3739 332.9212 11.23 0.000 3074.486 4403.514
------------------------------------------------------------------------------------------
There are no foreign, efficient, and heavy cars in the data, and when you let Stata know that you have categorical variables on the RHS, you get an understandable error message about why the triple interaction is missing.

Hamming distance for grouped results

I'm working with a data set that contains 40 different participants, with each 30 observations.
As I am observing search behavior, I want to calculate the search distance for each subject per round (from 1 to30).
In order to compare my data with current literature, I need to use the Hamming distance to describe search distances.
The variable is called Inputs and is a string variable with binary inputs 0 or 1 with a length of 10. E.g:
Input Type 1 Subject 1 Round 1: 0000011111
Input Type 1 Subject 1 Round 2: 0000011110
Using the Levensthein distance, my approach was simple:
sort type_num Subject round_num
gen input_prev=Input[_n-1]
replace input_prev="0000000000" if round_num==1 //default starting position with 0000000000 to get search distance for first input in round 1
//Levensthein distance & clearing data (Levensthein instead of hamming distance)
ustrdist Input input_prev
rename strdist input_change
I am now struggling with getting the right Stata commands for the Hamming distance. Can someone help?
Does this help? As I understand it, Hamming distance is the count of characters (bits) that differ at corresponding positions of strings of equal length. So, given two variables and wanting comparisons within each observation, it is just a loop over the characters.
clear
set obs 10
set seed 2803
* sandbox
quietly forval j = 1/2 {
gen test`j' = ""
forval k = 1/10 {
replace test`j' = test`j' + strofreal(runiform() > (`j' * 0.3))
}
}
set obs 12
replace test1 = 10 * "1" in 11
replace test2 = test1 in 11
replace test1 = test1[11] in 12
replace test2 = 10 * "0" in 12
* calculation
gen wanted = 0
quietly forval k = 1/10 {
replace wanted = wanted + (substr(test1, `k', 1) != substr(test2, `k', 1))
}
list
+----------------------------------+
| test1 test2 wanted |
|----------------------------------|
1. | 1110001111 1001101000 7 |
2. | 1111011011 1101011111 2 |
3. | 1011001111 1110110111 5 |
4. | 0000111011 1011010100 8 |
5. | 1011011011 1111100110 6 |
|----------------------------------|
6. | 0011111100 0100011110 5 |
7. | 0011011011 0011111010 2 |
8. | 1010100011 1011000100 5 |
9. | 1110011011 1010010100 5 |
10. | 1001011111 0100111001 6 |
|----------------------------------|
11. | 1111111111 1111111111 0 |
12. | 1111111111 0000000000 10 |
+----------------------------------+

Chi-Sq test result difference when done Manually and by SAS

I am trying to perform a chi-square test on my data using SAS University Edition.
Here is the strucure of my data
+----------+------------+------------------+-------------------+
| study_id | Control_id | study_mortality | control_mortality |
+----------+------------+------------------|-------------------+
| 1 | 50 | Alive | Alive |
| 1 | 52 | Alive | Alive |
| 2 | 65 | Dead | Dead |
| 2 | 70 | Dead | Alive |
+----------+------------+------------------+-------------------+
I am getting different results when I do the test with SAS Vs when I do it manually using an online calculator. I used the values from 'PROC FREQ' to calculate the Chi-Sq using online calculator. Here are the outputs of frequencies and the Chi-sq test. Can someone point where the issue is.
proc freq data = mydata;
tables study_mortality control_mortality;
where type=1;
run;
+-----------------+-------------------+
| study_mortality | Frequency |
+-----------------+-------------------
| Alive | 7614 |
| Dead | 324 |
+-----------------+-------------------+
+----------------- +-------------------+
| control_mortality| Frequency |
+----------------- +-------------------
| Alive | 6922 |
| Dead | 159 |
+----------------- +-------------------+
proc freq data = mydata;
tables study_mortality*control_mortality/ CHISQ;
where type=1;
run;
+-----------------+-------------------+---------+-------+
| | Control_mortality | | |
+-----------------+-------------------+---------+-------+
| Study_mortality | Alive | Dead | Total |
| Alive | 5515 | 134 | 5649 |
| Dead | 249 | 5 | 254 |
| Total | 5764 | 139 | 5903 |
+-----------------+-------------------+---------+-------+
Statistic DF Value Prob
Chi-Square 1 0.1722 0.6782
Likelihood Ratio Chi-Square 1 0.1818 0.6699
Continuity Adj. Chi-Square 1 0.0414 0.8388
Mantel-Haenszel Chi-Square 1 0.1722 0.6782
Phi Coefficient -0.0054
Contingency Coefficient 0.0054
Cramer's V -0.0054
You have missing data. Look at the N's on those tables.
Study Mortality is around 8000 and Control Mortality is around 7000 but when you cross them you only have 5903 records. This means that certain records are excluded. There should be a line in the output saying N missing somewhere. Not sure if SAS didn't put it there or you only pasted selected output. The P value matches exactly when I use an online calculator and also match your output.
data have;
infile cards;
input Study Control N;
cards;
1 1 5515
1 0 134
0 1 249
0 0 5
;
run;
proc freq data=have;
table study*control / chisq;
weight N;
run;

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Compute year on year growth of GDP for each quarter

I want to compute in Stata year on year growth rate of the GDP for each quarter. Basically, I want to compute : (gdp_q1y1980-gdp_q1y1979)/gdp_q1y1979.
// create some silly example data
clear
set obs 10
gen time = _n
format time %tq
gen gdp = _n^2*100
// do the computation
tsset time
gen growth = S4.gdp / L4.gdp
// admire the result
list
For more information see here.
I prefer #Maarten Buis' solution but know also that you could use subscripting:
sort time
gen growth = (gdp / gdp[_n-4]) - 1
Run help subscripting for details (or http://www.stata.com/help.cgi?subscripting).
Note that extra care must be taken if, for example, there is a gap in your time series:
. clear all
. set more off
.
. // create some silly example data
. set obs 15
obs was 0, now 15
. gen time = _n
. format time %tq
. gen gdp = _n^2*100
.
. // create a gap deleting 1962q4
. drop in 11
(1 observation deleted)
.
. // using -tsset-
. tsset time
time variable: time, 1960q2 to 1963q4, but with a gap
delta: 1 quarter
. gen growth = S4.gdp / L4.gdp
(5 missing values generated)
.
. // subscripting
. sort time
. gen growth2 = (gdp / gdp[_n-4]) - 1
(4 missing values generated)
.
. list, separator(0)
+--------------------------------------+
| time gdp growth growth2 |
|--------------------------------------|
1. | 1960q2 100 . . |
2. | 1960q3 400 . . |
3. | 1960q4 900 . . |
4. | 1961q1 1600 . . |
5. | 1961q2 2500 24 24 |
6. | 1961q3 3600 8 8 |
7. | 1961q4 4900 4.444445 4.444445 |
8. | 1962q1 6400 3 3 |
9. | 1962q2 8100 2.24 2.24 |
10. | 1962q3 10000 1.777778 1.777778 |
11. | 1963q1 14400 1.25 1.938776 |
12. | 1963q2 16900 1.08642 1.640625 |
13. | 1963q3 19600 .96 1.419753 |
14. | 1963q4 22500 . 1.25 |
+--------------------------------------+
The results for the solution with subscripts (variable growth2) are messed up once the gap begins (1963q1). A good reason, I think, to prefer tsset.