Linear regression - testing for multicolinearity in the presence of heteroskedasticty in Eviews - eviews

My question concerns use of VIF test for multicolinearity diagnostics when the model suffers from heteroskedasticity.
I want to use HAC correction to account for heteroskedasticity in my model. However VIF gives my starkly different results depending if I run it after estimating the model with simple OLS without error correction, compared to when I start with regression with HAC applied and then run VIF. I use Eviews.
For me it was surprising as the test statistic in VIF is just an 1/(1-R^2) where R^2 is calculated for a model in which given x_i variable is regressed against the rest of X variables. This implies that the result should not depend on standard errors of the estimated parameters in our original y against X regression, and thus should not depend on whether I use robust errors or not.
However, in Eviews VIF is calculted differently and estimates of standard errors for parameters are used (tutorial, pp. 198 of the pdf). While it is suggested that both approaches are equivalent, clearly is not the case in my example.
In short, in which order should I proceed - first test for multicolinearity with simple OLS model and then move on to model with HAC, or the other way - estimate the model with HAC and then run VIF?
Thanks for all your help!

Related

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

Using predict for ancillary parameters in maximum likelihood model in Stata

I wanted to know whether I can use the predict option for ancillary parameters (maximum likelihood ) program as follows (I estimated lnsigma and so sigma is the ancillary parameter in the model):
predict lnsigma, eq(lnsigma)
gen sigma=exp(lnsigma)
I also would like to know whether we can use above for heteroscedastic model.
Thank you in advance.
That sounds correct. I would be more explicit by typing predict lnsigma, xb eq(lnsigma). This way your code will not be broken when someone later on desides to write a prediction program for your estimation program and sets the default to something different than the linear prediction.
You can also do it in one line:
predictnl sigma = exp(xb(#2))
This assumes that lnsigma is the second equation in your model. If it is the third equation you replace xb(#2) with xb(#3). predictnl is also also an easy way of using the delta method to predict standard errors and confidence intervals for sigma.
I assume this is your own Stata program. If that is true, then you also have a third option: You can create your own prediction program, which Stata's predict command will recongnize. You can find some useful tricks on how to do that here: http://www.stata.com/help.cgi?_pred_se

Block bootstrap with indicator variable for each block

I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.

What could be causing errors when estimating coefficients with xtgls in stata for unbalanced panel data over 4 years?

I am using unbalanced panel data for 4 years. In trying to decide which time variant model (xtgls, xtreg, re, or xtgee) is most appropriate for my analysis, I am trying to estimate coefficients for xtgls under both the homoskedasticity and hetero assumptions. When I run this model with the hetero option, I get very high z-scores (>30) and a significant effect on a term that is insig in all other models.
Also, when I attempt to run lrtest comparing the hetero and homoskedastic models I get an error that reads “hetero does not contain scalar e(ll)”. I read that one way to address this is to add option igls, which supposedly gives the same coeff as the model without the igls option. However, my model will not converge with the igls option. I thought these odd results for the hetero xtgls model could be because some time invariant variable was miscoded (i.e. person coded as female = 1 for one year and female = 0 for another year). I checked my 2 ivs and this is not the case. I can’t figure out what else could be causing this.
So my specific questions are:
Why would I be getting this error - “hetero does not contain scalar e(ll)” - for the lrtest comparing the homo and hetero models? What does it mean?
Below is my stata code:
xtgls continuous_DV IV1 IV2 IV1xIV2, i(person_id) panels(hetero)
estimates store hetero
xtgls continuous_DV IV1 IV2 IV1xIV2, i(person_id)
local df=e(N_g)-1
disp `df'
lrtest hetero ., df(`df')
I ran xttest3 which indicated errors are hetero.
Is igls an appropriate work around for the error I am getting following the lrtest (“hetero does not contain scalar e(ll)”)? If so, what could be causing this model with the igls option not to converge? Below is the code:
xtgls continuous_DV IV1 IV2 IV1xIV2, i(person_id) panels(hetero) igls
In Stata,
the xtgls command does not estimate a log likelihood because it is not maximum likelihood estimation. So you cannot get a log-likelihood test out of that model. To get a log-likelihood, you need to use the setup you had above but instead use the igls option. That is an appropriate workaround and is entirely appropriate; I don't think you need to start by slashing your dataset.
Alternatively, you can use a different estimator. GLS is appropriate when you have few, wide panels. If you have really short panels (only a couple years per individual), you should probably use something like xtreg.
http://www.stata.com/support/faqs/statistics/xtgls-versus-regress/

Equivalent R^2 for Logit Regression in Stata

I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic