multivariate nonlinear regression - c++

i am new to machine learning and i wanted to implement multivariate nonlinear regression , and i cant seem to find any good C++ library on it
(for example: data)
y x1 x2 x3 x4 x5 x6
4.52e+005 8.32e+000 4.10e+001 8.801e+002 1.29e+002 3.22e+002 1.26e+002
3.585e+005 8.30e+000 2.10e+001 7.099e+003 1.10e+003 2.40e+003 1.13e+003
3.521e+005 7.25e+000 5.20e+001 1.467e+003 1.91e+002 4.96e+002 1.77e+002
3.413e+005 5.64e+000 5.20e+001 1.274e+003 2.35e+002 5.58e+002 2.19e+002
3.422e+005 3.84e+000 5.20e+001 1.627e+003 2.81e+002 5.65e+002 2.59e+002
i need non linear regression model to predict the output for given input variables(x1,x2,x3,x4,x5,x6)

I would try out kernel ridge regression and/or support vector regression on this. Either of them will probably work quite well.
The dlib C++ library has easy to use implementations of both of these methods. See the support vector regression or kernel ridge regression example programs for the details. Note that these examples show just one input variable but all you need to do is change the dimensionality of the input vector to something other than 1. So in the examples, that means you just change the line
typedef matrix<double,1,1> sample_type;
to
typedef matrix<double,6,1> sample_type;
and then they will work on 6 input variables.

Related

Linear regression - testing for multicolinearity in the presence of heteroskedasticty in Eviews

My question concerns use of VIF test for multicolinearity diagnostics when the model suffers from heteroskedasticity.
I want to use HAC correction to account for heteroskedasticity in my model. However VIF gives my starkly different results depending if I run it after estimating the model with simple OLS without error correction, compared to when I start with regression with HAC applied and then run VIF. I use Eviews.
For me it was surprising as the test statistic in VIF is just an 1/(1-R^2) where R^2 is calculated for a model in which given x_i variable is regressed against the rest of X variables. This implies that the result should not depend on standard errors of the estimated parameters in our original y against X regression, and thus should not depend on whether I use robust errors or not.
However, in Eviews VIF is calculted differently and estimates of standard errors for parameters are used (tutorial, pp. 198 of the pdf). While it is suggested that both approaches are equivalent, clearly is not the case in my example.
In short, in which order should I proceed - first test for multicolinearity with simple OLS model and then move on to model with HAC, or the other way - estimate the model with HAC and then run VIF?
Thanks for all your help!

How can you remove only the interaction terms in a polynomial regression using scikit-learn?

I am running a polynomial regression using scikit-learn. I have a large number of variables (23 to be precise) which I am trying to regress using polynomial regression with degree 2.
interaction_only = True, keeps only the interaction terms such as X1*Y1, X2*Y2, and so on.
I want only the other terms i.e, X1, X12, Y1, Y12, and so on.
Is there a function to get this?
There is no such function, because the transormation can be easily expressed with numpy itself.
X = ...
new_X = np.hstack((X, X**2))
and analogously if you want to add everything up to degree k
new_X = np.hstack((X**(i+1) for i in range(k)))
I know this thread is super old. But for folks like me who just getting started can use petsy. Checkout the answer discussed here ->
how to the remove interaction-only columns from sklearn.preprocessing.PolynomialFeatures

Calculating difference in survival functions at time t in Stata

I am estimating a Cox model in Stata using stcox. I estimate the model at
stcox treat x1 x2 x3
I can then use the stcurve command to plot the survival function for treatment and control groups, with the x1, x2 and x3 variables set at their means by doing
stcurve, survival at1(treat=0) at2(treat=1)
However, I would also like to calculate the difference in the survival function at specific, discrete points in time. For instance, I'd like to know the probability of survival to 1 year for treated and control groups, with x's set to their means. I think I might be able to do this with the sts generate command and its adjustfor option, but I am a little confused about whether I should use by or strata when using sts generate and I'm also not sure how to hold the control variables at their means rather than at 0. The Stata help pages suggest I can center the values of the controls by subtracting x1's mean from x1, but I am not sure if I am reading this correctly.
I wrote an answer to a similar question which might be useful here, but do not have enough reputation to answer this as a comment, so here goes:
You can do stcox treat x1 x2 x3 and stcurve, survival at1(treat=0) at2(treat=1) outfile(stcurve.dta). In the file stcurve.dta you will have the data that produced the graph, which could be used for you purposes of looking up specific timepoints (not terribly familiar with stcox and stcurve but think it should work).
regarding sts generate: I should use by(treat) if you only want to see the difference between treatment groups. However, using by vs strata is more of a statistical question which I am not qualified to answer, You are reading the help pages correctly, but cannot say if this approach makes statistical sense and do not know if sts generate allows for several variables in adjustfor. My suggestion is just try it and see if it works. Make your own mean variable and use that to make a new x to use in sts generate if the data you need is not in stcurve.dta. Something like this:
local varlist = x1 x2 x3
foreach x in `varlist' {
egen mean_`x'=mean(`x')
gen adj_`x'=`x' - mean_`x'
}
*
sts generate survival=s, by(treat) adjustfor(adj_x1)
then you will have your data in the variable called survival. another possibility is using sts list by(treat) adjustfor(adj_x1) compare instead of sts generate

Using predict for ancillary parameters in maximum likelihood model in Stata

I wanted to know whether I can use the predict option for ancillary parameters (maximum likelihood ) program as follows (I estimated lnsigma and so sigma is the ancillary parameter in the model):
predict lnsigma, eq(lnsigma)
gen sigma=exp(lnsigma)
I also would like to know whether we can use above for heteroscedastic model.
Thank you in advance.
That sounds correct. I would be more explicit by typing predict lnsigma, xb eq(lnsigma). This way your code will not be broken when someone later on desides to write a prediction program for your estimation program and sets the default to something different than the linear prediction.
You can also do it in one line:
predictnl sigma = exp(xb(#2))
This assumes that lnsigma is the second equation in your model. If it is the third equation you replace xb(#2) with xb(#3). predictnl is also also an easy way of using the delta method to predict standard errors and confidence intervals for sigma.
I assume this is your own Stata program. If that is true, then you also have a third option: You can create your own prediction program, which Stata's predict command will recongnize. You can find some useful tricks on how to do that here: http://www.stata.com/help.cgi?_pred_se

Equivalent R^2 for Logit Regression in Stata

I am running Logit Regression in Stata.
How can I know the explanatory power of the regression (in OLS, I look at R^2)?
Is there a meaningful approach in expanding the regression with other independent variables (in OLS, I manually keep on adding the independent variables and look for adjusted R^2; my guess is Stata should have simplified this manual process)?
The concept of R^2 is meaningless in logit regression and you should disregard the McFadden Pseudo R2 in the Stata output altogether.
Lemeshow recommends 'to assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation' with the Likelihood ratio test (G): G=D(Model without variables [B])-D(Model with variables [A]).
The Likelihood ratio test (G):
H0: coefficients for eliminated variables are all equal to 0
Ha: at least one coefficient is not equal to 0
When the LR-test p>.05 do not reject H0, which implies that, statistically speaking, there is no advantage to include the additional IV's into the model.
Example Stata syntax to do this is:
logit DV IV1 IV2
estimates store A
logit DV IV1
estimates store B
lrtest A B // i.e. tests if A is 'nested' in B
Note, however, that many more aspects have to checked and tested before we can conclude whether or not a logit model is 'acceptable'. For more detauls, I recommend to visit:
http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.html
and consult:
Applied logistic regression, David W. Hosmer and Stanley Lemeshow , ISBN-13: 978-0471356325
I'm worried that you are getting the fundamentals of modelling wrong here:
The explanatory power of a regression model is theoretically determined by your interpretation of the coefficients, not by the R-squared. The R^2 represents the amount of variance that your linear model predicts, which might be an appropriate benchmark to your model, or not.
Identically, the presence or absence of an independent variable in your model requires substantive justification. If you want to have a look at how the R-squared changes when adding or subtracting parts of your model, see help nestreg for help on nested regression.
To summarize: the explanatory power of your model and its variable composition cannot be determined just by crunching the numbers. You first need an adequate theory to build your model onto.
Now, if you are running logit:
Read Long and Freese (Ch. 3) to understand how log likelihood converges (or not) in your model.
Do not expect to find something as straightforward as the R-squared for logit.
Use logit diagnostics on your model, just like you should be after running OLS.
You might also want to read the likelihood ratio Chi-squared test or run additional lrtest commands as explained by Eric.
I certainly agree with the above posters that almost any measure of R^2 for a binary model like logit or probit shouldn't be considered very important. There are ways to see how good of a job your model does at predicting. For example, check out the following commands:
lroc
estat class
Also, here's a good article for further reading:
http://www.statisticalhorizons.com/r2logistic