I wonder if there is a way to model this following model in SAS (preferably with some built-in functions):
Y_{ij} = \beta_{i}(t_{j}) X_{i} + e_{ij},
which basically is a form of functional mixed-effects model without the random-effects parts. Here each subject i has observations at time t_{j} (j=1,...,n_{i}), and X_{i} is the covariates not varying with time within subjects but varying between subjects. The random errors follow iid normal distribution.
The closest form of model I found so far is the functional-coefficient model which normally is used for time-series data, although I'm not sure if it's applicable to time-invariant covariates. Anyone knows if there is a way to model this using some exiting software (preferrably SAS)?
Related
I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies
I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH
After I have trained a model, how do I use it with C++?
I have tried MXNet incubator-mxnet/example/image-classification/predict-cpp/
and incubator-mxnet/cpp-package/example/.
As part of training you should periodically evaluate your model against a validation set, at the end of each epoch for example. You should then have a good idea of the expected accuracy of the model when using the model to score new data, to determine if the model is really performing worse than expected at inference time.
If the validation accuracy of the model while training the model is no better than random (i.e. 1/number of classes), there could be many reasons for this including; poor model selection, incorrect loss calculation, wrong optimization technique and hyperparameters (e.g. learning rate).
If the test accuracy of the model on unseen data is poor, you might be trying to apply the model to a different domain to which it was trained. You can't use a model trained on handwritten characters (e.g. MNIST) to classify real world objects (e.g. ImageNet).
If you need a C++ example of model training, take a look at this tutorial.
i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.
I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.