model.remove_constraint() performance - python-2.7

I'm working with CPLEX/docplex solving an LP problem that has a lot of infeasible constraints, most of the issues in feasibility come from the automated formulation of the model, and its hard to detect a priory the conflicts between constraints.
using the docplex functions ConflictRefiner().refine_conflict(model) im able to found, at least, one set of constraints in conflict.
The problem is that, in order to found all the sets of constraints in conflict, I have to remove some of the constraints in conflict using the function model.remove_constraint(constraint.name) and that function takes a long time to execute.
Edit the timings for 135.000 constraints are:
model.remove_constraint(constraint.name)
time= 124 sec
model.remove_constraint(constraint.element)
time= 126 sec
¿Is there a way to remove a constraint faster than with model.remove_constraint(str_name_constraint)?¿is there a way to get all the sets in conflict without having to remove/refine_conflict() for each set?¿is there a way to use hierarchy in constraints in order to avoid conflicts between constraints?
(last question its a little out of topic, but its related with the original problem)
thanks in advance!

finally I used a workaround,
I didn't use mdl.remove_constraint(). to all the constraints i added a priority, and then I used the relaxer library provided by [docplex][1]. I couldn't found any example in the docs (or anywhere else) of the use of the relaxer, so i made one on my own (really simple to understand). The relaxer library is a really powerful tool, and its way much more easier to use rather than making all the relaxations by hand, especially when you have to deal with hierarchies in the constraints.
Example:
from docplex.mp.model import Model
import docplex
# we create a simple model
mdl = Model("relax_model")
x1=mdl.continuous_var(name='X1', lb=0)
x2=mdl.continuous_var(name='X2', lb=0)
# add conflict constraints
c1=mdl.add_constraint(x1<=10,'c1_low')
c2=mdl.add_constraint(x1<=5,'c2_medium')
c3=mdl.add_constraint(x1>=400,'c3_high')
c4=mdl.add_constraint(x2>=1,'c4_low')
mdl.minimize(x1+x2)
mdl.solve()
print mdl.report()
print mdl.get_solve_status() #infeasible model
print
print 'relaxation begin'
from docplex.mp.relaxer import Relaxer
rx = Relaxer(prioritizer='match')
rx.relax(mdl,relax_mode= docplex.mp.relaxer.RelaxationMode.OptInf)
print 'number_of_relaxations= ' + str(rx.number_of_relaxations)
print rx.relaxations()
print mdl.report()
print mdl.get_solve_status()
print mdl.solution
I know that this isn't "the solution" for the model.remove_constraint() performance problem, but it fits well when you need to avoid it.

Related

userWarning pymc3 : What does reparameterize mean?

I built a pymc3 model using the DensityDist distribution. I have four parameters out of which 3 use Metropolis and one uses NUTS (this is automatically chosen by the pymc3). However, I get two different UserWarnings
1.Chain 0 contains number of diverging samples after tuning. If increasing target_accept does not help try to reparameterize.
MAy I know what does reparameterize here mean?
2. The acceptance probability in chain 0 does not match the target. It is , but should be close to 0.8. Try to increase the number of tuning steps.
Digging through a few examples I used 'random_seed', 'discard_tuned_samples', 'step = pm.NUTS(target_accept=0.95)' and so on and got rid of these user warnings. But I couldn't find details of how these parameter values are being decided. I am sure this might have been discussed in various context but I am unable to find solid documentation for this. I was doing a trial and error method as below.
with patten_study:
#SEED = 61290425 #51290425
step = pm.NUTS(target_accept=0.95)
trace = sample(step = step)#4000,tune = 10000,step =step,discard_tuned_samples=False)#,random_seed=SEED)
I need to run these on different datasets. Hence I am struggling to fix these parameter values for each dataset I am using. Is there any way where I give these values or find the outcome (if there are any user warnings and then try other values) and run it in a loop?
Pardon me if I am asking something stupid!
In this context, re-parametrization basically is finding a different but equivalent model that it is easier to compute. There are many things you can do depending on the details of your model:
Instead of using a Uniform distribution you can use a Normal distribution with a large variance.
Changing from a centered-hierarchical model to a
non-centered
one.
Replacing a Gaussian with a Student-T
Model a discrete variable as a continuous
Marginalize variables like in this example
whether these changes make sense or not is something that you should decide, based on your knowledge of the model and problem.

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Distinguishing between terms of different domains

What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.

How can I select Yes/No qestionID dynamically in weka j48 App

I'm developing a Weka app like Akinator by using the j48 method.
Sample:
http://jbossews-vdoctor.rhcloud.com/doctor
The following is the app's table definition and sample data
qa means question id(Please refer the master which can be set by user) + answer(1:Yes, 2: I don't know, 3: No).
1 line per 1 question & answer.
id,qa,class
A,13,1
A,23,1
B,13,2
B,21,2
The point is to find a way to select the question which can maximize the entropy.
Currently this app is regarding first node id of decision tree as the best question.
And then it narrows down the options by this elimination way.
But the accuracy was too bad to run correctly so I'd like to improve it.
I noticed that the qa column was identified as numeric so it could not build the correct decision tree.
I am confused what I should do for improvement. Dataset? Table definition? Logic?
This is quite a broad question that you are asking, and without code or a clear understanding of the problem it is quite difficult to answer, but I'll give some tips for improvement:
Table Definition
What may have made more sense here is to have an attribute for each question, instead of using a single instance per question. For Example, instead of id, qa and class, you could have A, B, C, D, E, F and Disease. (I believe there were six questions, and naming each attribute would be recommended instead of A-F)
Dataset
You will need at least as many cases as there are diseases, if not more for defining multiple subsets of the problem space for the same disease. There are likely cases where some questions are irrelevant or missing, and the model may need to handle such situations.
Logic
In such a case, you might be able to do the questionnaire by starting with the root node and asking questions until you reach the estimated class. This way, you can ask from node to node until a class is reached.
I hope this helps in improving your existing model.
NOTE: I tried your questionnaire and answered No to all of your questions, and I strangely ended up with Trichomoniasis. Perhaps there could be a 'No Disease' category for your training data also.
My nominal qa data is building such a decision tree by binary split.
actually this structure won't make sense because there is tree at only one side. When qa equal 23 it would be always '3' answer. It's irrational.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
You should first reformat your features to get all possible questions A,B,C,D... as binary features and your final answer (ie. what to guess) as target class if you want your tree to get a sequence of questions reaching to your answer. Your data will certainly be sparse (many questions without data/answer).
By the way, a binary tree is not the right ML structure and algorithm to build an Akinator like or 20Q/Guess-who. Please look some suggestions here: https://stats.stackexchange.com/questions/6074/akinator-com-and-naive-bayes-classifier

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.