Prediction models, Objective functions and Optimization

Prediction models, Objective functions and Optimization - pyomo

How do we define objective functions while doing optimization using pyomo in Python. We have defined Prediction models separately. Next step is to bring objective functions from prediction models (Gradient boosting, Random forest , Linear regression and others) and optimize to achieve maximum and minimum optimization. please suggest and share any working example in pyomo.

Due to Pyomo use algebraic expression you should:
Define the mathematical expression of your prediction model function.
Implement the proper mathematical model in Pyomo including the needing parameters, variables and other constraints.
Apply the min - max
You can make a cycle as follows:
Prediction model function -> Min-max refinement -> Prediction model function adjustment -> Min-max refinement -> ...
As many times you need to reach yor expected accuracy. API connection and multi-thread implementation could work.

Related

Influence diagrams / Decision models in Stan and PyMC3

Is it possible to write decision-making models in either Stan or PyMC3? By that I mean: we define not only the distribution of random variables, but also the definition of decision and utility variables, and determine the decisions maximizing expected utility.
My understanding is that Stan is more of a general optimizer than PyMC3, so that suggests decision models would be more directly implemented in it, but I would like to hear what people have to say.
Edit: While it is possible to enumerate all decisions and compute their corresponding expected utility, I am wondering about more efficient methods since the number of decisions could be combinatorially too many (for example, how many items to buy from a list with thousands of products). Influence diagram algorithms exploit factorizations in the model to identify independences that allow computing of the decisions on only a smaller set of relevant random variables. I wonder if either Stan or PyMC3 do that kind of thing.

The basic steps for Bayesian decision theory are:
Enumerate a finite set of decisions that could be made
Specify a utility function of the decision and perhaps other things
Draw from the posterior distribution of all the unknowns given the known data
Evaluate the utility function for each possible decision and each posterior draw
Make the decision with the highest expected utility, averaging over the posterior draws.
You can do those five steps with any software --- Stan and PyMC3 included --- that produces (valid) draws from the posterior distribution. In Stan, the utility function should be evaluated in the generated quantities block.

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?

I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

Is there a way to quantify impact of independent variables with gradient boosting?

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!

Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :
Link 1
Link 2
Link 3
Link 4
etc.
Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:
Naive-Bayes formula :
P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))
tf-idf weighting can be employed in the above formula as:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
total_unique_words_in_all_classes : as is.
This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.
Please suggest me. I am new to this domain.

It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:
You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.
Your Solution
The tf idf you gave is the following:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.
Another potential Solution
It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.
If you wanted to use this method:
Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.
Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.
Additional methods to increase
Apart from the above, you could also use the following techniques to increase accuracy:
Preprocessing:
Feature reduction (usually NMF, PCA, or LDA)
Additional features
Algorithm:
Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.
Bootstrapping, boosting, etc. Be careful not to overfit though...
Hopefully this was helpful. Leave a comment if anything was unclear

P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes
(basically vocabulary of words in the entire training set))
How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is
P(word1|class)+P(word2|class)+...+P(wordn|class) =
(total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)
To correct this, I think the P(word|class) should be like
(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))
Please correct me if I am wrong.

I think there are two ways to do it:
Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.
I am not sure if Gaussian mixture will be better.

Inference in a Bayesian Network

I need to perform some inferences on a Bayesian network, such as the example I have created below.
I was looking at doing something like something like this to solve an inference such as P(F| A = True, B = True). My initial approach was to do something like
For every possible output of F
For every state of each observed variable (A,B)
For every unobserved variable (C, D, E, G)
// Calculate Probability
But I don't think this will work because we actually need to go over many variables at once, not each at a time.
I have heard about Pearls algorithm for message passing but am yet to find a reasonable description that isn't extremely dense. For added information, these Bayesian networks are constrained as to not have more than 15-20 nodes, and we have all the conditional probability tables, the code doesn't really have to be fast or efficient.
Basically I am looking for a way to do this, not necessarily the BEST way to do this.

Your Bayesian Network (BN) does not seem to be particularly complex. I think you should easily get away with using exact inference method, such as junction tree algorithm. Of course, you can still just do brute force enumeration, but that would be a waste of CPU resources given that there are so many nice libraries out there that implement smarter ways of doing both exact and approximate inference in graphical models.
Since your tag mentions C++, my recommendation would be libDAI. It is a well written library that implements multiple exact and approximate inference on generic factor graphs. It does not have any weird dependencies and is very easy to integrate into your project. It is particularly well suited for discrete cases, such as yours, for which you have the probability tables.
Now, you noticed that I mentioned factor graphs. If you are not familiar with the concept, I will refer you to Wikipedia article on factor graphs as well as What are "Factor Graphs" and what are they useful for?. The principle is very simple, you represent your BN as a factor graph and then libDAI will do the inference for you.
EDIT:
Since CPU resources do not seem to be a problem for you and simplicity is the key, you can always go with brute force enumeration. The idea is straightforward.
Your Bayesian Network represents a joint probability distribution, which you can write down in terms of an equation, e.g.
P(A,B,C) = P(A|B,C) * P(B|C) * P(C)
Assuming that you have tables for all your conditional probability distributions, i.e. P(A|B, C) P(B|C) and P(C) then you can simply go over all the possible values of variables A, B, and C and calculate the output.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js