I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH
Related
I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies
Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
https://en.wikipedia.org/wiki/Learning_to_rank
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.
(cross-posted at http://www.statalist.org/forums/forum/general-stata-discussion/general/1370770-margins-plot-of-treatment-effect-rather-than-y-for-values-of-a-covariate)
I'm running a multivariate regression (outcome variable is continuous, happens to be GPA). The covariate of interest is a dummy variable for treatment status; another of the covariates is a pre-score. We want to look at how the treatment effect differs at various values of pre-score. The structure of the model is not complicated:
regress GPA treatment pre_score X3 X4 X5...
What I want is a graph that shows what the treatment effect is (values of Beta1) at various values of pre-score (X2). It's straightforward to get a graph with values of the OUTCOME at various values of X2:
margins, at(pre_score= (1(0.25)5)) post
marginsplot
I have consulted an array of resources and tried alternatives using marginscontplot, coefplot with recast, the dy/dx option, and so forth. I remain unsuccessful. But this seems like something that there must be a way to do; wanting to know if a treatment effect varies for values of a control (say, income) must be common.
Can anyone direct me to the right command, or options for Margins, to output values of Beta1 (coefficient on treatment dummy), rather than of Y (GPA), at values of the pre_score?
Question was resolved at Statalist. Turns out that Margins alone can't do what I was trying to; the model needs to be run with an interaction term. Then it's simple.
I have a classroom of students with test scores taken weekly. I expect the test results to improve over time. I want to identify a poor performer as an outlier based on not improving over time using SAS (have 9.2). Also are there accepted criteria for being an outlier for part of the time interval but not the complete time interval? This is the bulk of my present code (not looking for outliers yet, just longitudinal analysis):
proc mixed data= XYZ_LONG ;
title1 'XYZ Analysis';
class group day subject ;
model TV = group day group*day / ddfm=satterthwaite;
repeated day / type = cs sub = subject ;
Your definition of "poor performer" is not a definition of outlier, I don't think. However:
If you want to find people who did not improve over time, that's pretty easy, but you have to define it more precisely. Did not improve between any two weeks? The first and last weeks? Something else?
And what do you mean by "not improve" exactly? Do you mean it literally (same or worse score at later time?)
In either case, I'd use an array and find difference scores and then identify difference scores that were negative (or whatever you want).
However, if you are going to be doing modelling, then an outlier should probably be defined in terms of that model - that is, in your model, accounting for group. But if you have a lot of outliers and they aren't bad data, you should not throw out those people, but use a better model.
I'm trying to save output from several hundred eststo's storing results of bivariate probability models into one excel file using esttab. It works for xtlogit(both ,re and ,pa), xtprobit (both ,re and ,pa) and for the linear probability model xtreg (both standard and ,fe. However, when I use xtreg y x i.year, fe I get the error message too many base levels specified. Google doesn't help me much.
I've been trying for an hour to create a reproducible example but the stata datasets all work fine. It does not seem to be due to the number of years or the fact that different specifications have data for different years. Still, the normal xtreg, fe' works, the problem only appears with time dummies. The weirdest thing is that it works for all subsets of my variables but not for the whole list (again just the time fixed effects specifications).
Does anyone have an idea how to proceed? Using drop(*.year) works whenever the problem does not arise (so in specifications where it works, I get outputs without the year dummies) but does not prevent the too many base levels specified error; ,nobaselevels has no apparent effect as well. Is there a way to remove the time fixed effects from eststo before I pass those on to esttab? Any workaround would be appreciated as well.
The problem you might be facing is that of Stata creating different base levels for the factor variable year, in different regressions.
Try fixing the factor variable base level beforehand with fvset:
fvset base <some_number> year
Check help fvset and the manual entry for details. Also, read the source given below, which contains more information.
Source: two posts from Statalist; one from Tim Wade and another by Jeff Pitblado.