Suppose I have a regression where the response variable is sales, and I have various drivers of sales as the independent variables. I want to build a model using MCMC but I am unsure if it is even possible ( I am running in SAS). See below for a simplified model structure (there are many more variables and random interactions in the production model):
Yij=β0+β1TVX1ij+γ(TV×dma)i+εi
For the model above, I have one main effect for TV represented by β1 and a random interaction between DMA (there are 210 DMAS in the US) and TV which is represented by gamma. I have priors for all my parameters and when I run MCMC in SAS, it takes hours to run. Can MCMC handle 210 random interaction for the random term? I am using MCMC because I want to utilize the prior knowledge from previous modeling rounds but it makes no sense if it takes forever to run.
proc mcmc data=modeldbsubset outpost=postout thin=1000 nmc=20000 seed=7893
monitor=(b0 b1);
ods select PostSummaries PostIntervals tadpanel;
parms b0 0 b1 0;
parms s2 1 ;
parms s2g 1;
prior b: ~ normal(0, var = 10000);
prior s2: ~ igamma(0.001, scale = 1000);
random gamma ~ normal(0, var=s2g) subject = dmanum monitor = (gamma) namesuffix = position;
mu = b0 + b1*TV + gamma;
model Y ~ normal(mu, var = s2);
I don’t use SAS, but it’s no surprise this scale of model would fail miserably with the default random walk Metropolis they use, which initializes with an identity cov matrix for the proposal distribution. The documentation on scale tuning says you can tune to a MAP estimate of cov (this is what PyMC3 does by default), so maybe try starting there. However, the docs also say doing this will then use the MAP for parameter initialization, which is a bad idea since MAP is usually not in the typical set at high dimensions.
In the end, I expect you’ll need to do a lot of tuning specific to your data to really get it running cleanly, and unfortunately that’s just part of the art.
Alternatively, you might be better off picking up a more advanced MCMC sampling framework that implements HMC/NUTS, such as Stan or PyMC3 or Edward. There are even some high-level packages, like RStanArm, specifically for Bayesian regression modeling, but which keep the lower level MCMC stuff in the background.
Related
I want to run a GEE for clustered data - I am trying to get incidence rate ratios (IRR) for antibiotic reactions between two drugs. I have searched for information on constructing GEE models (GENMOD in SAS, xtgee in Stata) but I can't find criteria on what type of variables can be included as covariates. My model is this:
proc genmod data = mydata;
class Pt fev1_cat;
model rate_pip = cumulative_dose_before fev1_cat Average_Dose_Admis mero_rate /
type3 dist=poisson link=log;
repeated subject=Pt;
run;
rate_pip is the rate of adverse events (AE) for antibiotic in question, mero_rate is the rate of AE for a different antibiotic. The other variables are either categorical or continuous.
If I adjust the GEE with a covariate that is a rate, is it 1) a correct use of the GEE model, and 2) would the interpretation of the exp(coef) be the IRR between the two rates of AE, or is it interpreted as: for each unit increase in rate of mero_rate, the IRR of rate_pip is x times higher/lower?
I can't say whether this is a correct use of a GEE model without knowing a little more about the data structure, but I don't know of anything that's special about GEE models that would preclude the use of rate variables as predictors (as compared to say a ordinary least squares regression model). If the model is okay without the mero_rate predictor, it would probably be okay with it too. Maybe the caveat is that it can't be too correlated with the other predictors.
As far as interpretation goes, I think you've pretty much got it. The log of the incidence rate increases by beta units for a mero_rate value of x+1 events per unit time, compared to a mero_rate value of x events per unit time, all other things equal.
I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).
I've got a large do-file that calls several sub-do-files, all in the lead-up to the estimation of a custom maximum likelihood model. That is, I have a main.do, which looks like this
version 12
set seed 42
do prepare_data
* some other stuff
do estimate_ml
and estimate_ml.do looks like this
* lots of other stuff
global cdf "normal"
program define customML
args lnf r noise
tempvar prob1l prob2l prob1r prob2r y1l y2l y1r y2r euL euR euDiff scale
quietly {
generate double `prob1l' = $ML_y2
generate double `prob2l' = $ML_y3
generate double `prob1r' = $ML_y4
generate double `prob2r' = $ML_y5
generate double `scale' = 1/100
generate double `y1l' = `scale'*((($ML_y10+$ML_y6)^(1-`r'))/(1-`r'))
generate double `y2l' = `scale'*((($ML_y10+$ML_y7)^(1-`r'))/(1-`r'))
generate double `y1r' = `scale'*((($ML_y10+$ML_y8)^(1-`r'))/(1-`r'))
generate double `y2r' = `scale'*((($ML_y10+$ML_y9)^(1-`r'))/(1-`r'))
generate double `euL' = (`prob1l'*`y1l')+(`prob2l'*`y2l')
generate double `euR' = (`prob1r'*`y1r')+(`prob2r'*`y2r')
generate double `euDiff' = (`euR'-`euL')/`noise'
replace `lnf' = ln($cdf( `euDiff')) if $ML_y1==1
replace `lnf' = ln($cdf(-`euDiff')) if $ML_y1==0
}
end
ml model lf customML ... , maximize technique(nr) difficult cluster(id)
ml display
To my great surprise, when I run the whole thing from top to bottom in Stata 12/SE I get different results for one of the coefficients reported by ml display each time I run it.
At first I thought this was a problem of running the same code on different computers but the issue occurs even if I run the same code on the same machine multiple times. Then I thought this was a random number generator issue but, as you can see, I can reproduce the issue even if I fix the seed at the beginning of the main do-file. The same holds when I move the set seed command immediately above the ml model.... The only way to get the same results though multiple runs is if I run everything above ml model and then only run ml model and ml display repeatedly.
I know that the likelihood function is very flat in the direction of the parameter whose value changes over runs so it's no surprise it can change. But I don't understand why it would, given that there seems to be little that isn't deterministic in my do files to begin with and nothing that couldn't be made deterministic by fixing the seed.
I suspect a problem with sorting. The default behaviour is that if two observations have the same value, they will be sorted randomly. Moreover, the random process that guides this sorting is governed by a different seed. This is intentional, as it prevents users to by accident see consistency where none exist. The logic being that it is better to be puzzled than to be overly confident.
As someone mentioned in the comments to this answer, adding the option stable to my sort command made the difference in my situation.
I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.
I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic