I am running a regression in E-views with cross-section data for a sample of 314 companies to check the factors that affect the abnormal return in the M&A deals.
The regression is
CAR ROE_ACQUIRER ROE_TARGET TARGET_DISTRESS PAYMENT_METHOD
where CAR is the Cumulative Abnormal Return calculated through event study methodology.
And Target distress, Payment method are dummy variables
The problem is that I have both autocorrelation and heteroscedasticity problems and negative data that I cannot use the log and all my variables are not significant except the target distress
My question is, how can I deal with the autocorrelation and heteroscedasticity with other methods, other than the log and the first difference.
Many thanks,
Budor
Related
I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies
My question concerns use of VIF test for multicolinearity diagnostics when the model suffers from heteroskedasticity.
I want to use HAC correction to account for heteroskedasticity in my model. However VIF gives my starkly different results depending if I run it after estimating the model with simple OLS without error correction, compared to when I start with regression with HAC applied and then run VIF. I use Eviews.
For me it was surprising as the test statistic in VIF is just an 1/(1-R^2) where R^2 is calculated for a model in which given x_i variable is regressed against the rest of X variables. This implies that the result should not depend on standard errors of the estimated parameters in our original y against X regression, and thus should not depend on whether I use robust errors or not.
However, in Eviews VIF is calculted differently and estimates of standard errors for parameters are used (tutorial, pp. 198 of the pdf). While it is suggested that both approaches are equivalent, clearly is not the case in my example.
In short, in which order should I proceed - first test for multicolinearity with simple OLS model and then move on to model with HAC, or the other way - estimate the model with HAC and then run VIF?
Thanks for all your help!
I am generating alerts by reading dataset for KPI (key performance indicator) . My algorithm is looking into historical data and based on that I am able to capture if there's sudden spike in data. But I am generating false alarms . For example KPI1 is historically at .5 but reaches value 12, which is kind of spike .
Same way KPI2 also reaches from .5 to 12. But I know that KPI reaching from .5 to 12 is not a big deal and I need not to capture that . same way KPI2 reaching from .5 to 12 is big deal and I need to capture that.
I want to train my program to understand what is high value , low value or normal value for each KPI.
Could you experts tell me which is best ML algorithm is for this and any package in python I need to explore?
This is the classification problem. You can use classic logistic regression algorithm to classify any given sample into either high value, low value or normal value.
Quoting from the Wikipedia,
In statistics, multinomial logistic regression is a classification
method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes. That is, it is
a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable,
given a set of independent variables (which may be real-valued,
binary-valued, categorical-valued, etc.)
To perform multi-class classification in python, sklearn library can be useful.
http://scikit-learn.org/stable/modules/multiclass.html
I want to run a GEE for clustered data - I am trying to get incidence rate ratios (IRR) for antibiotic reactions between two drugs. I have searched for information on constructing GEE models (GENMOD in SAS, xtgee in Stata) but I can't find criteria on what type of variables can be included as covariates. My model is this:
proc genmod data = mydata;
class Pt fev1_cat;
model rate_pip = cumulative_dose_before fev1_cat Average_Dose_Admis mero_rate /
type3 dist=poisson link=log;
repeated subject=Pt;
run;
rate_pip is the rate of adverse events (AE) for antibiotic in question, mero_rate is the rate of AE for a different antibiotic. The other variables are either categorical or continuous.
If I adjust the GEE with a covariate that is a rate, is it 1) a correct use of the GEE model, and 2) would the interpretation of the exp(coef) be the IRR between the two rates of AE, or is it interpreted as: for each unit increase in rate of mero_rate, the IRR of rate_pip is x times higher/lower?
I can't say whether this is a correct use of a GEE model without knowing a little more about the data structure, but I don't know of anything that's special about GEE models that would preclude the use of rate variables as predictors (as compared to say a ordinary least squares regression model). If the model is okay without the mero_rate predictor, it would probably be okay with it too. Maybe the caveat is that it can't be too correlated with the other predictors.
As far as interpretation goes, I think you've pretty much got it. The log of the incidence rate increases by beta units for a mero_rate value of x+1 events per unit time, compared to a mero_rate value of x events per unit time, all other things equal.
I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic