Influence diagrams / Decision models in Stan and PyMC3 - pymc3

Is it possible to write decision-making models in either Stan or PyMC3? By that I mean: we define not only the distribution of random variables, but also the definition of decision and utility variables, and determine the decisions maximizing expected utility.
My understanding is that Stan is more of a general optimizer than PyMC3, so that suggests decision models would be more directly implemented in it, but I would like to hear what people have to say.
Edit: While it is possible to enumerate all decisions and compute their corresponding expected utility, I am wondering about more efficient methods since the number of decisions could be combinatorially too many (for example, how many items to buy from a list with thousands of products). Influence diagram algorithms exploit factorizations in the model to identify independences that allow computing of the decisions on only a smaller set of relevant random variables. I wonder if either Stan or PyMC3 do that kind of thing.

The basic steps for Bayesian decision theory are:
Enumerate a finite set of decisions that could be made
Specify a utility function of the decision and perhaps other things
Draw from the posterior distribution of all the unknowns given the known data
Evaluate the utility function for each possible decision and each posterior draw
Make the decision with the highest expected utility, averaging over the posterior draws.
You can do those five steps with any software --- Stan and PyMC3 included --- that produces (valid) draws from the posterior distribution. In Stan, the utility function should be evaluated in the generated quantities block.

Related

How to write tests for mathematical optimization procedures?

I'm working on project where I need to minimize functions by several variables like func(input_parameters, variable_parameters) -> min(variable_parameters).
I use optimizing functions from SciPy, so minimization process is a grey box: I can see the code on GitHub and read about used algorithms, but I'd like to think that it's okay and aim to testing of my own project.
Though, particular libraries shouldn't matter in this question.
At the moment I use few approaches:
Create simple examples and find global/local minima by hand and create test that performs optimization and compares its solution with the right one
If method needs gradients, compare analytically calculated gradients with their numerical approximation in tests
For iterative algorithms built upon ones provided by SciPy check that sequence of function values is monotonically nonincreasing in tests
Is there a book or an article about testing of mathematical optimization procedures?
P. S. I'm not talking about Test functions for optimization
, I'm asking about approaches used to test optimization procedure to find bugs faster.
I find the hypothesis library really useful for testing optimisation algorithms in development.
You can set it up to generate random test cases (functions, linear programs, etc) according to some specification. The idea is that you pass these to your algorithm and test for known invariants. For example you could have it throw random problems or subproblems at your algorithm and check that (for example):
Gradient descent methods produce a series of nonincreasing objectives
Local search finds a solution with no better neighbours
Heuristics maintain feasibility
There's a useful PyCon talk here explaining the idea of property based testing. It focuses more on testing APIs than algorithms, but I think the ideas transfer. I've found this approach does a pretty good job finding cases of unexpected behaviour as I'm writing a new algorithm.

Choosing the best subset of features

I want to choose the best feature subset available that distinguish two classes to be fed into a statistical framework that I have built, where features are not independent.
After looking at the feature selection methods on machine learning it seems that it fall into three different categories: Filter,wrapper and Embedded methods. and the filter methods can be either: univariate or multivariate. It does make sense to use either Filter(multivariate) or wrapper methods because both -as I understood- looks for the best subset, however, as I am not using a classifier how can use it ?
Does it make sense to apply such methods (e.g. Recursive feature
elimination ) to DT or Random Forest classifier where the features
have rules there, and then take the resulted best subset and fed it
into my framework ?**
Also, as most of the algorithms provided by Scikit-learn are
univariate algorithms, Is there any other python-based libraries
that provide more subset feature selection algorithms ?
I think the statement that "most of the algorithms provided by Scikit-learn are univariate algorithms" is false. Scikit-learn handles multi-dimensional data very nicely. The RandomForestClassifier that they provide will give you an estimate of feature importance.
Another way to estimate feature importance is to choose any classifier that you like, train it and estimate performance on a validation set. Record the accuracy and this will be your baseline. Then take that same train/validation split and randomly permute all values along one feature dimension. Then train and validate again. Record the difference of this accuracy from your baseline. Repeat this for all feature dimensions. The results will be a list of numbers for each feature dimension that indicates its importance.
You can extend this to use pairs, or triples of feature dimensions, but the computational cost will grow quickly. If you're features are highly correlated you may benefit from doing this for at least the pairwise case.
Here is the source document of where I learned that trick: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
(It should work for classifiers other than Random Forests.)

Is there a way to quantify impact of independent variables with gradient boosting?

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!
Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

Is it possible to specify number of nodes in hidden layer of SVM in OpenCV 3.1.0?

I was just curious to have a better control over outcome of the SVM.
Tried to search the documentation, but couldn't find a function that seems to do the same.
One could say that SVM does not have hidden nodes, but this is only partially true.
SVM, originally, were called Support Vector Networks (this is what Vapnik himself called them), and they were seen as a kind of neural networks with a single hidden layer. Due to the popularity of neural networks in this time, many people till this day use sigmoid "kernel" even though it is rarely a valid Mercer's kernel (only because NN community was so used to using it they started doing so even though it has no mathematical justification).
So is SVM a neural net or not? Yes, it can be seen as a neural network. In fact, many classifiers can be seen through such prism. However, what makes SVM really different is the way they are trained and parametrized. In particular, SVMs work with "activation functions" which are valid Mercer's kernels (they denote dot product in some space). Furthermore, weights of the hidden nodes are equal to training samples, thus you get the same amount of hidden units as you have training examples. During training, SVM, on its own, reduces number of hidden units through solving an optimization problem which "prefers" sparse solutions (removal of hidden units), thus ending up with the hidden layer consisting of the subset of training samples, we call them support vectors. To underline, this is not a classical view of SVMs, but it is a valid perspective, which might be more easy to understand by someone from NN community.
So can you control this number? Yes and no. No, because SVM needs all this hidden units to have a valid optimization problem, and it will remove all redundant ones on its own. Yes, because there is an alternative optimization problem, called nu-SVM, which uses nu-hyperparamer, which is lower bound of support vectors, thus lower bound of hidden units. You cannot, unfortunately, directly specify the upper bound.
But I really need to! If this is the case, you can go with approximate solutions which will follow your restriction. You can use H-dimensional sampler which approximate the kernel space explicitely (http://scikit-learn.org/stable/modules/kernel_approximation.html). One of such methods is Nystroem method. In short terms, if you want to have "H hidden units" you simply fit Nystroem model to produce H dimensional output, you transfrom your input data through it, and fit linear SVM on top. This, from mathematical perspective** is approximating true non-linear SVM with a given kernel, however quite slowly.

Inference in a Bayesian Network

I need to perform some inferences on a Bayesian network, such as the example I have created below.
I was looking at doing something like something like this to solve an inference such as P(F| A = True, B = True). My initial approach was to do something like
For every possible output of F
For every state of each observed variable (A,B)
For every unobserved variable (C, D, E, G)
// Calculate Probability
But I don't think this will work because we actually need to go over many variables at once, not each at a time.
I have heard about Pearls algorithm for message passing but am yet to find a reasonable description that isn't extremely dense. For added information, these Bayesian networks are constrained as to not have more than 15-20 nodes, and we have all the conditional probability tables, the code doesn't really have to be fast or efficient.
Basically I am looking for a way to do this, not necessarily the BEST way to do this.
Your Bayesian Network (BN) does not seem to be particularly complex. I think you should easily get away with using exact inference method, such as junction tree algorithm. Of course, you can still just do brute force enumeration, but that would be a waste of CPU resources given that there are so many nice libraries out there that implement smarter ways of doing both exact and approximate inference in graphical models.
Since your tag mentions C++, my recommendation would be libDAI. It is a well written library that implements multiple exact and approximate inference on generic factor graphs. It does not have any weird dependencies and is very easy to integrate into your project. It is particularly well suited for discrete cases, such as yours, for which you have the probability tables.
Now, you noticed that I mentioned factor graphs. If you are not familiar with the concept, I will refer you to Wikipedia article on factor graphs as well as What are "Factor Graphs" and what are they useful for?. The principle is very simple, you represent your BN as a factor graph and then libDAI will do the inference for you.
EDIT:
Since CPU resources do not seem to be a problem for you and simplicity is the key, you can always go with brute force enumeration. The idea is straightforward.
Your Bayesian Network represents a joint probability distribution, which you can write down in terms of an equation, e.g.
P(A,B,C) = P(A|B,C) * P(B|C) * P(C)
Assuming that you have tables for all your conditional probability distributions, i.e. P(A|B, C) P(B|C) and P(C) then you can simply go over all the possible values of variables A, B, and C and calculate the output.