I have a panel dataset for multiple firms throughout 8 years and I'm trying to use a pooled OLS with industry-specific effects with the `reghdfe' command to control for a categorical variable (NAICS Industry Code). I typed
reghdfe DV IV control variables i.year, absorb(NAICS Industry Code)
Is this the correct way to use the command? Is it correct to use i.year within the variables or should I add it to the absorbed variables?
In addition I'm using a Fixed Effect Panel Regression and control for clustered standard errors. Do I have to control for clustered standard errors in the reghdfe as well or is it sufficient to just do it within the fixed effect panel regression?
You should include your variable year in the absorb() option to catch the intended use of reghdfe:
reghdfe y x, absorb(naics year)
Alternatively, you can also use reg y x i.naics i.year.
I assume NAICS codes to be numeric; otherwise, you might need to transform the variable to numeric, e.g. using egen num_naics= group(naics).
Note: The R-squared rests on different assumptions and might differ between the two commands.
Note_2: If your question is specifically about coding, everyone is better off when you provide example data. Statistical questions might be better suited for Cross Validated.
Related
I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH
In Stata, is there a way to redirect the data that a command does into a table instead of a graph?
Example: if someone created a normal probability distribution of data with the pnorm var_name command, is there a way to redirect the data so that instead of appearing in a graph, it appears in a table?
To add to #Noobie's answer:
Different commands work in different ways. There's no better short summary.
What you can look out for includes
generate() options that produce new variables. (There is absolute rule that the options have this name, but that or a similar name is the most common single variety.)
Options that allow saving results to new datasets.
Saved results, especially those visible after return list or ereturn list. These can be quite elaborate, e.g. saving of matrices of counts after tabulate.
More broadly, Stata commands aren't functions! One characteristic of a function, as so named in many languages or programs, is that there is a result, with special cases where the result is void or null. There clearly are statistical programs which in broad terms hinge on calling functions which have results, and what you see displayed is often a side-effect of that. Stata commands don't work like that in the sense that the results of a program can be various. In the case of commands designed just to show something, the "result" may be a display. It's worth noting that Mata, which underlies and underpins Stata, is more recognisably a C-like language, with (e.g.) many matrix extensions, which is based on functions (and much else).
Yes and no. It really depends on the command you are using. You should look at the help files first.
For instance, pnorm does not allow that. You can create the data yourself using the formula for pnorm described in the help file, where the cumulative distribution at some point is plotted against the so-called plotting position.
Other Stata commands allow you to generate the points directly. This is the case for kdensity for instance.
(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.
I'm trying to save output from several hundred eststo's storing results of bivariate probability models into one excel file using esttab. It works for xtlogit(both ,re and ,pa), xtprobit (both ,re and ,pa) and for the linear probability model xtreg (both standard and ,fe. However, when I use xtreg y x i.year, fe I get the error message too many base levels specified. Google doesn't help me much.
I've been trying for an hour to create a reproducible example but the stata datasets all work fine. It does not seem to be due to the number of years or the fact that different specifications have data for different years. Still, the normal xtreg, fe' works, the problem only appears with time dummies. The weirdest thing is that it works for all subsets of my variables but not for the whole list (again just the time fixed effects specifications).
Does anyone have an idea how to proceed? Using drop(*.year) works whenever the problem does not arise (so in specifications where it works, I get outputs without the year dummies) but does not prevent the too many base levels specified error; ,nobaselevels has no apparent effect as well. Is there a way to remove the time fixed effects from eststo before I pass those on to esttab? Any workaround would be appreciated as well.
The problem you might be facing is that of Stata creating different base levels for the factor variable year, in different regressions.
Try fixing the factor variable base level beforehand with fvset:
fvset base <some_number> year
Check help fvset and the manual entry for details. Also, read the source given below, which contains more information.
Source: two posts from Statalist; one from Tim Wade and another by Jeff Pitblado.
I have a dataset of membership information, and I want to keep only the people who have been continuously enrolled for the entire year. There are 12 variables for each person, one for each month of the year with how many days during that month they were enrolled. Is there a way to make a subset of the data for just those with a value >1 for each of the month variables?
Thanks!
SAS has various summary functions that might well be what you're looking for. See min() (minimum) in particular, as it will allow you to find the minimum of several variables. You may also want to consider nmiss() (number of missing values) and n() (number of non-missing values) if you have to deal with missing values in your data.
Summary functions can be passed lists of variables like this (in a data step):
minimum = min(var1, var2, var3);
However, that can become long winded if you need to use a lot of variables. Fortunately, SAS provides several ways to reference lists of variables to make things neater. You can read about these variable lists here. To use them in a summary function use the of qualifier:
minimum = min(of var1-var12);
maximum = max(of var:);
blanks = nmiss(of _NUMERIC_);
Finally you will want to use your new found data to decide whether what data to include. To do this in a data step look at the output statement (user guide):
if min(of var:) > 1 then output;
Or if you feel like learning a bit more about SAS's syntax you could try using an implicit output by reading through the last link.
In general it's preferred to ask specific questions and show your current work on SO, and I'd advise using google to answer your basic questions while you're learning the fundamentals. There is plenty of great documentation available to help you out there.