I am able to build the model for all the additive or multiplicative variables as follows,
import statsmodels.formula.api as sm
result_ad = sm.ols('Y_variable~x1+x2*x3).fit()
result_ad.summary()
These x1,x2 and x3 are subset of the overall set of features, my dataset has x1,x2,x3,x4,x5.... I would like to build a common model using all these features (features which are additive/multiplicate and non-additives.multiplicatives).
But ols will not allow us to mix these variables. May be I am missing the systax or how to call ols using both the set of variables/features.
Related
I have a panel dataset for multiple firms throughout 8 years and I'm trying to use a pooled OLS with industry-specific effects with the `reghdfe' command to control for a categorical variable (NAICS Industry Code). I typed
reghdfe DV IV control variables i.year, absorb(NAICS Industry Code)
Is this the correct way to use the command? Is it correct to use i.year within the variables or should I add it to the absorbed variables?
In addition I'm using a Fixed Effect Panel Regression and control for clustered standard errors. Do I have to control for clustered standard errors in the reghdfe as well or is it sufficient to just do it within the fixed effect panel regression?
You should include your variable year in the absorb() option to catch the intended use of reghdfe:
reghdfe y x, absorb(naics year)
Alternatively, you can also use reg y x i.naics i.year.
I assume NAICS codes to be numeric; otherwise, you might need to transform the variable to numeric, e.g. using egen num_naics= group(naics).
Note: The R-squared rests on different assumptions and might differ between the two commands.
Note_2: If your question is specifically about coding, everyone is better off when you provide example data. Statistical questions might be better suited for Cross Validated.
I'm working on a calculator that uses some mathematical formulas produced with multi linear regression. The problem here is this formulas should be constructed in dynamic way. With the admin interface the user can update and create formulas. The issue is how to model f= a0 +a1x + a2x^2.... and save all the data on the database(coefficients).
And in the future we will have complex formulas for example a formula that includes(exponential functions, sin, cos...) . Is there like a library or better way to model the application to support these features.
I am new in prediction models. I am currently using python2.7 and sklearn. I would like to know a simple model to combine many features to predict one target.
To make it more clear. Lets say I have 4 arrays of size 10: A,B,C,Y. I would like to use the values of A,B,C to predict the values of Y.
Thank you
I'm using the RFECV module in sklearn to find the optimal number of features to yield the highest Cross validation on 2 folds. I am using a ridge regressor as my estimator.
rfecv = RFECV(estimator=ridge,step=1, cv=KFold(n_splits=2))
rfecv.fit(df, y)
I have 5 features in my dataset that I have standardized using the standardscaler.
I'll run the RFECV on my data, and it'll say that 2 features is optimal. But when I remove one of the features with the lowest regression coefficient and rerun the RFECV, it now says that 3 features is optimal.
When I progress through all features one at a time (as the recursive should do) I find that 3 is in fact the optimal.
I've tested this with other datasets, and have found that the optimal number of features changes as I remove features one at a time and rerun RFECV.
I might be missing something, but isn't that what RFECV is supposed to solve?
Any additional insights on RFECV is appreciated.
This makes sense actually. RFECV is recommending a certain number of features based on the available data. When you remove the feature you change the scoring range.
from the docs:
# Determine the number of subsets of features by fitting across
# the train folds and choosing the "features_to_select" parameter
# that gives the least averaged error across all folds.
...
n_features_to_select = max(
n_features - (np.argmax(scores) * step),
n_features_to_select)
n_features_to_select is used to determine how many features should be used in RFE for any particular iteration (within/under-the-hood of RFECV).
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
And so this is directly connected to the number of features you include in your initial rfecv.fit() step.
Also, removing the feature with the lowest regression coefficient is not the best way to trim features. The coefficient is a reflection of its impact on the dependent variable not necessarily the model's accuracy.
I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)
I then plan to extract the indices of the included features, per label, using this:
selected_features = []
for i in multi_clf.estimators_:
selected_features += list(i.named_steps["chi2"].get_support(indices=True))
Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.
My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?
According to the conclusions in this paper:
[...] rank features according to the average or the maximum
Chi-squared score across all labels, led to most of the best
classifiers while using less features.
Then, in order to select a good subset of features you just need to do (something like) this:
from sklearn.feature_selection import chi2, SelectKBest
selected_features = []
for label in labels:
selector = SelectKBest(chi2, k='all')
selector.fit(X, Y[label])
selected_features.append(list(selector.scores_))
// MeanCS
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold
Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]). Also, there is a threshold variable that should be fixed beforehand.
http://scikit-learn.org/stable/modules/feature_selection.html
There is a multitude of options, but SelectKBest and Recursive feature elimination are two reasonably popular ones.
RFE works by leaving uniformative features out of the model, and retraining, and comparing the results, so that the features left at the end are the ones which enable the best prediction accuracy.
What is best is highly dependant on your data and use case.
Aside from what can loosely be described as cross validation approaches to feature selection, you can look at Bayesian model selection, which is a more theoretical approach and tends to favor more simple models over complex ones.