Running a GEE model for a rate, and adjusting with a covariate that is also a rate (GENMOD SAS) - sas

I want to run a GEE for clustered data - I am trying to get incidence rate ratios (IRR) for antibiotic reactions between two drugs. I have searched for information on constructing GEE models (GENMOD in SAS, xtgee in Stata) but I can't find criteria on what type of variables can be included as covariates. My model is this:
proc genmod data = mydata;
class Pt fev1_cat;
model rate_pip = cumulative_dose_before fev1_cat Average_Dose_Admis mero_rate /
type3 dist=poisson link=log;
repeated subject=Pt;
run;
rate_pip is the rate of adverse events (AE) for antibiotic in question, mero_rate is the rate of AE for a different antibiotic. The other variables are either categorical or continuous.
If I adjust the GEE with a covariate that is a rate, is it 1) a correct use of the GEE model, and 2) would the interpretation of the exp(coef) be the IRR between the two rates of AE, or is it interpreted as: for each unit increase in rate of mero_rate, the IRR of rate_pip is x times higher/lower?

I can't say whether this is a correct use of a GEE model without knowing a little more about the data structure, but I don't know of anything that's special about GEE models that would preclude the use of rate variables as predictors (as compared to say a ordinary least squares regression model). If the model is okay without the mero_rate predictor, it would probably be okay with it too. Maybe the caveat is that it can't be too correlated with the other predictors.
As far as interpretation goes, I think you've pretty much got it. The log of the incidence rate increases by beta units for a mero_rate value of x+1 events per unit time, compared to a mero_rate value of x events per unit time, all other things equal.

Related

Analyze Repeated Measures Data Using PROC GLIMMIX

I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH

MCMC taking forever to run in SAS

Suppose I have a regression where the response variable is sales, and I have various drivers of sales as the independent variables. I want to build a model using MCMC but I am unsure if it is even possible ( I am running in SAS). See below for a simplified model structure (there are many more variables and random interactions in the production model):
Yij=β0+β1TVX1ij+γ(TV×dma)i+εi
For the model above, I have one main effect for TV represented by β1 and a random interaction between DMA (there are 210 DMAS in the US) and TV which is represented by gamma. I have priors for all my parameters and when I run MCMC in SAS, it takes hours to run. Can MCMC handle 210 random interaction for the random term? I am using MCMC because I want to utilize the prior knowledge from previous modeling rounds but it makes no sense if it takes forever to run.
proc mcmc data=modeldbsubset outpost=postout thin=1000 nmc=20000 seed=7893
monitor=(b0 b1);
ods select PostSummaries PostIntervals tadpanel;
parms b0 0 b1 0;
parms s2 1 ;
parms s2g 1;
prior b: ~ normal(0, var = 10000);
prior s2: ~ igamma(0.001, scale = 1000);
random gamma ~ normal(0, var=s2g) subject = dmanum monitor = (gamma) namesuffix = position;
mu = b0 + b1*TV + gamma;
model Y ~ normal(mu, var = s2);
I don’t use SAS, but it’s no surprise this scale of model would fail miserably with the default random walk Metropolis they use, which initializes with an identity cov matrix for the proposal distribution. The documentation on scale tuning says you can tune to a MAP estimate of cov (this is what PyMC3 does by default), so maybe try starting there. However, the docs also say doing this will then use the MAP for parameter initialization, which is a bad idea since MAP is usually not in the typical set at high dimensions.
In the end, I expect you’ll need to do a lot of tuning specific to your data to really get it running cleanly, and unfortunately that’s just part of the art.
Alternatively, you might be better off picking up a more advanced MCMC sampling framework that implements HMC/NUTS, such as Stan or PyMC3 or Edward. There are even some high-level packages, like RStanArm, specifically for Bayesian regression modeling, but which keep the lower level MCMC stuff in the background.

Value of coefficient (Beta1) at different values of other covariate (X2), hopefully graphed

(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.

Need help to find outlier in longitudinal data using sas

I have a classroom of students with test scores taken weekly. I expect the test results to improve over time. I want to identify a poor performer as an outlier based on not improving over time using SAS (have 9.2). Also are there accepted criteria for being an outlier for part of the time interval but not the complete time interval? This is the bulk of my present code (not looking for outliers yet, just longitudinal analysis):
proc mixed data= XYZ_LONG ;
title1 'XYZ Analysis';
class group day subject ;
model TV = group day group*day / ddfm=satterthwaite;
repeated day / type = cs sub = subject ;
Your definition of "poor performer" is not a definition of outlier, I don't think. However:
If you want to find people who did not improve over time, that's pretty easy, but you have to define it more precisely. Did not improve between any two weeks? The first and last weeks? Something else?
And what do you mean by "not improve" exactly? Do you mean it literally (same or worse score at later time?)
In either case, I'd use an array and find difference scores and then identify difference scores that were negative (or whatever you want).
However, if you are going to be doing modelling, then an outlier should probably be defined in terms of that model - that is, in your model, accounting for group. But if you have a lot of outliers and they aren't bad data, you should not throw out those people, but use a better model.

Equivalent R^2 for Logit Regression in Stata

I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic