I am using Stata 13 and I have a balanced panel dataset (t=Year and i=Individual denoted by Year and IndvID respectively) and the following econometric model
Y = b1*var1 + b2*var2 + b3*var1*var2 + b4*var4 + fe + epsilon
am estimating the following fixed-effects regression with year dummies and a linear time trend
xi: xtreg Y var1 var2 c.var1#c.var2 var3 i.Year i.IndvID|Year, fe vce(cluster IndvID)
(all variables are continuous except for dummies being created by i.Year and i.IndvID|Year)
I want Stata to derive/report the overall marginal effect of var1 and var2 on the outcome Y:
dY/dvar1 = b1 + b3*var2
dY/dvar2 = b2 + b3*var1
Because I estimate the fixed-effect regression using robust standard errors, I want to make sure the marginal effect are being computed taking into account the same heterogeneity that the clustered standard errors correct for. My understanding is that this can be achieved using the
vce(unconditional) option of the margins command. However, after running the above regression, when I run the command
margins, dydx(var1) vce(unconditional)
I get the following error:
xtreg is not supported by margins with the vce(unconditional) option
Am I missing something obvious here or am I not going about this correctly? How can I cluster standard errors for margin estimates computed for Stata rather than using the Delta Method default, which doesn't correct for this?
Thanks in advance,
-Mark
The marginal effect of var1 and var2 are functions (of var1 and var2, respectively). If you want the marginal effect of var1 at the mean level of var2, for example, you can use the lincom command after the regression:
sum var2
local m1 = r(mean)
lincom var1 + `m1' * c.var1#c.var2
This calculates the point estimate of the marginal effect at the mean as well as the standard errors derived from the robust covariance matrix estimated in xtreg.
Related
I'm running a multivariate linear regression model in SAS (v. 9.3) using the REG procedure with the stepwise statement, as follows below:
(1) Set the regressors list:
%let regressors = x1 x2 x3;
(2) Run the procedure:
ods output DWStatistic=DW ANOVA=F_Fisher parameterestimates=beta CollinDiag=Collinearita outputstatistics=residui fitstatistics=rsquare;
proc reg data=base_dati outest=reg_multivar edf;
model TD&eq. = ®ressors. /selection=stepwise`SLSTAY=&signif_amm_multivar_stay. SLENTRY=&signif_amm_multivar_entry. VIF COLLIN adjrsq DW R influence noint;
output out=diagnostic;
quit;
ods output close;
By adding one regressor to the list, let's say x4, to the macro-variable ®ressors., the beta value estimates change, although the selected variables are the same ones.
In practice, in both cases the variables chosen from such selection method are x1 and x2, but beta parameters for x1 and x2 change in the second case with respect to the second case.
Could you provide an explanation for that?
It would be nice to have a reference for such explanation.
Thanks all in advance!
I'm going to guess that you have missing data. SAS removes records row wise. So if you include 2 more variables that happen to have a few missing those entire records will be missing which means you're not actually using the exact same data between each regression model.
Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.
I was looking for a way to compute the effect size for Friedman's test using sas, but could not find any reference. I wanted to see if there is any difference between the groups and what its size was.
Here is my code:
proc freq data=mydata;
tables id*b*y / cmh2 scores=rank noprint;
run;
These are the results:
The FREQ Procedure
Summary Statistics for b by y
Controlling for id
Cochran-Mantel-Haenszel Statistics (Based on Rank Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 230.7145 <.0001
2 Row Mean Scores Differ 1 230.7145 <.0001
This question is correlated with the one posted on Cross Validated, that is concerned with the general statistical formula to compute the effect size for Friedman's test. Here, I would like to find out how to get the effect size in sas.
I want to run stepwise on a linear probability model with time and individual fixed effects in a panel dataset but stepwise does not support panels out of the box. The solution is to to run xtdata y x, fe followed by reg y x, r . However, the resulting standard errors are too small. One attempt to solve this problem can be found here: http://www.stata.com/statalist/archive/2006-07/msg00629.html but my panel is highly unbalanced (I have a different number of observations for different variables). I also don't understand how stepwise would include this information in its iterations with different variable lists. Since stepwise bases its decision rules on the pvals, this is quite crucial.
Reproducible Example:
webuse nlswork, clear /*unbalancing a bit:*/
replace tenure =. if year <73
replace hours =. if hours==24 | hours==38
replace tenure =. if idcode==3 | idcode == 12 | idcode == 19
xtreg ln_wage tenure hours union i.year, fe vce(robust)
eststo xtregit /*this is what I want to reproduce without xtreg to use it in stepwise */
xi i.year /* year dummies to keep */
xtdata ln_wage tenure hours union _Iyear*, fe clear
reg ln_wage tenure hours union _Iyear*, vce(hc3) /*regression on transformed data */
eststo regit
esttab xtregit regit
As you can see, the estimates are fine but I need to adjust the standard errors. Also, I need to do that in such a way that stepwise understands in its iterations when the number of variables changes, for example. Any help on how to proceed?
I have a panel data of firms (Panel specification: ID,Time) and trying to run Logit Random Effect model on this data in SAS where my binary (0,1) dependent variable is Def. This is the code that I am using:
PROC GLIMMIX DATA=mydata METHOD=QUAD(QPOINTS=21) NOCLPRINT;
CLASS ID;
MODEL Def (DESC) = VAR1 VAR2 VAR3 VAR4/SOLUTION DIST=BINARY LINK=LOGIT;
RANDOM INTERCEPT / SUBJECT=ID;
RUN;
When I run the code, in the output the only table that I get for the coefficient estimates is called: Solutions for Fixed Effects. My question is that where are the coefficient estimates for random effect? How can I get those estimates in output?
Thanks for your help in advance.
Sei
Just add a SOLUTION option to the RANDOM statement as well.
RANDOM INTERCEPT / SUBJECT=ID SOLUTION;