How to know if the optimization problem is infeasible or not? Pyomo Warning: Problem may be infeasible - pyomo

Pyomo can find a solution, but it gives this warning:
WARNING: Loading a SolverResults object with a warning status into
model=(SecondCD);
message from solver=Ipopt 3.11.1\x3a Converged to a locally infeasible point. Problem may be infeasible.
How do I know if the problem is infeasible or not?
this pyomo model optimizes a farm's decision of inputs allocation.
model.Crops = Set() # set Crops := cereal rapes maize ;
model.Inputs = Set() # set Inputs := land labor capital fertilizer;
model.b = Param(model.Inputs) # Parameters in CD production function
model.x = Var(model.Crops, model.Inputs, initialize = 100, within=NonNegativeReals)
def production_function(model, i):
return prod(model.x[i,j]**model.b[j] for j in model.Inputs)
model.Q = Expression(model.Crops, rule=production_function)
...
instance = model.create_instance(data="SecondCD.dat")
opt = SolverFactory("ipopt")
opt.options["tol"] = 1E-64
results = opt.solve(instance, tee=True) # solves and updates instance
instance.display()
if I set b >=1, (e.g.: param b := land 1 labor 1 capital 1 fertilizer 1),
pyomo can find optimal solution;
but if i set b < 1, (e.g.: param b := land 0.1 labor 0.1 capital 0.1 fertilizer 0.1), and set opt.options["tol"] = 1E-64, pyomo can find a solution, but gives that warning.
I expect an optimal solution, but the actual result gives the warning mentioned above.

The message you get (message from solver=Ipopt 3.11.1\x3a Converged to a locally infeasible point. Problem may be infeasible.) doesn't mean that the problem is necessarilly infeasible. A non-linear solver will typically give you a local optimum, and the path to get to the solution is a very important part of finding a "better" local optimum. When you tried with another point, you found a feasible solution, and that is the proof that your problem is feasible.
Now, in finding the global optimum instead of a local optimum, this is a little bit harder. One way to find out is to check if your problem is convex. If it is, it means that there will only be one local optimum, and that this local optimum is the global optimum. This can be done mathematically. See https://math.stackexchange.com/a/1707213/470821 and http://www.princeton.edu/~amirali/Public/Teaching/ORF523/S16/ORF523_S16_Lec7_gh.pdf from a quick Google search). If you found that your problem is not convex, then you can try to prove that there are few local optimums and that they can be found easily with good starting points. Finally, if this can't be done, you should consider more advanced techniques, all with their pros and cons. For example, you can try to generate a set of starting solutions to make sure that you cover the whole feasible domain of your problem. Another one would be to use meta-heuristics methods to help you find a better starting solution.
Also, I am sure that Ipopt have some tools to help tackling this problem of finding a good starting solution that improves the resulting local optimum.

Related

Linear programming feasibility: Non connexe solution ensemble

I would like to solve a feasibility problem subject to linear constraint. My constraint look like:
abs(x_i - x_j) < d_ij_1
abs(x_i - x_j - a) < d_ij_2
abs(x_i - x_j) > d_ij_3
etc...
I am adding a picture of an example for just 3 variables domain (I am fixing the first variable to 0). I know that the white region are valid solution, and for instance I can choose the red dot.
My issue is as I increase the number of unknown x_j, I cannot represent the problem anymore in a way that make it easy to find a solution. I was wondering how can I try to solve such a problem ? Would linear programming help, even though the solution space is not really connexe here ? For scale, I am looking at solving it for ~6-10 variables. Also, I posted here as I don't know what stack would be the most fitted for this kind of problem

userWarning pymc3 : What does reparameterize mean?

I built a pymc3 model using the DensityDist distribution. I have four parameters out of which 3 use Metropolis and one uses NUTS (this is automatically chosen by the pymc3). However, I get two different UserWarnings
1.Chain 0 contains number of diverging samples after tuning. If increasing target_accept does not help try to reparameterize.
MAy I know what does reparameterize here mean?
2. The acceptance probability in chain 0 does not match the target. It is , but should be close to 0.8. Try to increase the number of tuning steps.
Digging through a few examples I used 'random_seed', 'discard_tuned_samples', 'step = pm.NUTS(target_accept=0.95)' and so on and got rid of these user warnings. But I couldn't find details of how these parameter values are being decided. I am sure this might have been discussed in various context but I am unable to find solid documentation for this. I was doing a trial and error method as below.
with patten_study:
#SEED = 61290425 #51290425
step = pm.NUTS(target_accept=0.95)
trace = sample(step = step)#4000,tune = 10000,step =step,discard_tuned_samples=False)#,random_seed=SEED)
I need to run these on different datasets. Hence I am struggling to fix these parameter values for each dataset I am using. Is there any way where I give these values or find the outcome (if there are any user warnings and then try other values) and run it in a loop?
Pardon me if I am asking something stupid!
In this context, re-parametrization basically is finding a different but equivalent model that it is easier to compute. There are many things you can do depending on the details of your model:
Instead of using a Uniform distribution you can use a Normal distribution with a large variance.
Changing from a centered-hierarchical model to a
non-centered
one.
Replacing a Gaussian with a Student-T
Model a discrete variable as a continuous
Marginalize variables like in this example
whether these changes make sense or not is something that you should decide, based on your knowledge of the model and problem.

Auto-rounding up problems python

I am working on a problem, that essentially it comes down to solving the following equation (b/n)*((b-1)/(n-1)) = 0.5, where only the lower limit of n is 10**12. I was able to solve the problem making use of methods described here https://www.alpertron.com.ar/QUAD.HTM
However I also tried solving the problem as a quadratic equation, and checking than the answers are integers, and that the required ratio is reached. The program works for lower values of n, but as soon it starts approaching the required limit (10**12), it starts giving false solutions. For example, the program yields
b = 707106783028 and
n = 1000000002604
as a set of solutions, and yet it is not -> (b/n)*((b-1)/(n-1)) gives 0.499999999999, however python just takes it as 0.5. I tried using x.hex() to try to account for that, but it did not help. Is there any way to make python store/display the true (or most accurate) value of a float?

Random Forest with more features than data points

I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).

Weka Resample to balance instances in binary dataset

I've only been using Weka for a couple of weeks but I am absolutely blown away by how great it is!
But I have a question, I have a dataset with a target column which is either True or False.
6709 instances in my dataset are True
25318 instances are False.
I want to randomly add duplicates of my True instances to produce a new dataset with 25318 True and 25318 False.
The only filter I can find which does this is the supervised Resample filter however I am having trouble understanding what parameters I should use.
(there might be a better filter to do what I want)
I've got some success with these parameters
biasToUniformClass = 1.0
invertSelection = False
noReplacement = False
randomSeed = 1
sampleSizePercent = 157.5 (a magic number I've arrived at by trial and error)
This produces 25277 True and 25165 False. Not exactly what I want, but quite close.
The problem is that I cant figure out how to arrive at the magic number. I'm also not getting exactly the numbers of instances that I really want.
Is there a better filter for this purpose?
If not, is there a way to calculate the sampleSizePercent magic number?
Any help is greatly appreciated :)
Supplemental question, am I best to run NominalToBinary on my boolean columns to ensure they are Binary? I'm using a NaiveBayes classifier (at the moment) and I don't have any missing instances.
Jason
I think the tricky part of this question is getting a perfect balance using the Resample Filter. This is because, as it is stated in the description, it 'Produces a random sub-sample of a dataset using either sampling with replacement or without replacement'. If these cases are being drawn randomly, there is no guarantee that you will get an equal measure between the two classes.
As for the magic number, this would be associated with the total number of cases that you would like to have when the filter is applied. In your case, it would be 50636 instead of 32027. In this case, your magic number would be 50636 / 32027 = 1.581. However, as stated above, you may not get an exact match of true and false cases.
If you really need an exact figure, you could use your favourite spreadsheet and preprocess the data. One possible method is to randomise the true cases (in a separate column), sort and copy all of the cases until the number matches the false one. It's not an automated solution, and the solution is outside of Weka, but I have used this method before and does the job reasonably quickly.
Hope this Helps!