I am using PyMC3 to solve some problems in:
Essentially trying to solve the problem given in Crystal Ball Tutorial.pdf page 3-11. http://faculty.insead.edu/delquie/msp/Other%20downloads/Crystal%20Ball%20Tutorial.pdf
I am trying to create a Normal distribution with mean=8, sd=2 and has a lower limit of 5.
In other words it is a normal distribution (8,2) but instead of -infinity to infinity, it cuts off at 8.
Are there any examples top do this?
What you want to do is to sample from a bounded (normal) distribution. Using PyMC3 you can set arbitrary bounds on distributions like this.
with pm.Model() as model:
boundedN = pm.Bound(pm.Normal, lower=5.0)
a = boundedN('a', mu=8, sd=2)
Related
I am working on a multi-label classification. I used GaussianNB function on python scikit-learn. The target is an array with (N, L) shape, where L is the number of classes and N is the number of observations.
I used three ways to deal with multi-label case:
binary relevance
chain model
label powerset
I have a prior distribution for L classes, which is an array of (L,) shape. I tried to incorporate this prior distribution into GaussianNB through priors parameter like this
classifier = BinaryRelevance(GaussianNB(priors = prior_dist))
However, it returns the following error
ValueErrors: number of priors must match number of classes
What is the correct way to specify priors into GaussianNB in a multi-label case?
I haven't added support for this yet in scikit-multilearn, but it seems fairly easy to add - could you put it as a feature request in scikit-multilearn? I think I have an idea how to add this, but we can track the issue further in github.
I built a pymc3 model using the DensityDist distribution. I have four parameters out of which 3 use Metropolis and one uses NUTS (this is automatically chosen by the pymc3). However, I get two different UserWarnings
1.Chain 0 contains number of diverging samples after tuning. If increasing target_accept does not help try to reparameterize.
MAy I know what does reparameterize here mean?
2. The acceptance probability in chain 0 does not match the target. It is , but should be close to 0.8. Try to increase the number of tuning steps.
Digging through a few examples I used 'random_seed', 'discard_tuned_samples', 'step = pm.NUTS(target_accept=0.95)' and so on and got rid of these user warnings. But I couldn't find details of how these parameter values are being decided. I am sure this might have been discussed in various context but I am unable to find solid documentation for this. I was doing a trial and error method as below.
with patten_study:
#SEED = 61290425 #51290425
step = pm.NUTS(target_accept=0.95)
trace = sample(step = step)#4000,tune = 10000,step =step,discard_tuned_samples=False)#,random_seed=SEED)
I need to run these on different datasets. Hence I am struggling to fix these parameter values for each dataset I am using. Is there any way where I give these values or find the outcome (if there are any user warnings and then try other values) and run it in a loop?
Pardon me if I am asking something stupid!
In this context, re-parametrization basically is finding a different but equivalent model that it is easier to compute. There are many things you can do depending on the details of your model:
Instead of using a Uniform distribution you can use a Normal distribution with a large variance.
Changing from a centered-hierarchical model to a
non-centered
one.
Replacing a Gaussian with a Student-T
Model a discrete variable as a continuous
Marginalize variables like in this example
whether these changes make sense or not is something that you should decide, based on your knowledge of the model and problem.
I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
Now as for your case, when size is larger than 500, it will choose randomized.
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’ method is
enabled. Otherwise the exact full SVD is computed and optionally
truncated afterwards.
To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.
Try using
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem
I'm no king in python, and recently got in trouble with a modification I made in my code. My algorithm is basically multiple uses of stochastic gradient algorithm and thus needs random variables.
I wanted my code to handle custom random variables and probability distribution. To do so, I modified my code and now I use scipy.stats to draw samples of custom random variables. Basically, I create a random variable with an imposed probability density or a cumulative density, and then draw samples thanks to the inverse function of the cumulative distribution function and some uniform random variable between [0,1].
To make it simple the algorithm runs multiple optimization from different starting point using stochastic gradient algorithm, and thus can be parallelized since the starting points are independent.
Problem is that the random variable created this way can't be pickled
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
I don't get the subtility of pickling problems for now, so if you guys can help me solve this following simple illustration of the problem :
RV = scipy.stats.norm();
def Draw(rv,N):
return rv.ppf(np.random.random(N))
pDraw = partial(Draw,RV);
PM = multiprocessing.pool(Processes = 2);
L = PM.map(pDraw,range(1,5));
I've heard of pathos library that do not use the same serialization algorithm (dill), but I would like to avoid this solution (if it is a solution) as it is not included in my python distribution at work... making it install will take a lot of time.
I'm writing some tests for a C++ command line Linux app. I'd like to generate a bunch of integers with a power-law/long-tail distribution. Meaning, I get a some numbers very frequently but most of them relatively infrequently.
Ideally there would just be some magic equations I could use with rand() or one of the stdlib random functions. If not, an easy to use chunk of C/C++ would be great.
Thanks!
This page at Wolfram MathWorld discusses how to get a power-law distribution from a uniform distribution (which is what most random number generators provide).
The short answer (derivation at the above link):
x = [(x1^(n+1) - x0^(n+1))*y + x0^(n+1)]^(1/(n+1))
where y is a uniform variate, n is the distribution power, x0 and x1 define the range of the distribution, and x is your power-law distributed variate.
If you know the distribution you want (called the Probability Distribution Function (PDF)) and have it properly normalized, you can integrate it to get the Cumulative Distribution Function (CDF), then invert the CDF (if possible) to get the transformation you need from uniform [0,1] distribution to your desired.
So you start by defining the distribution you want.
P = F(x)
(for x in [0,1]) then integrated to give
C(y) = \int_0^y F(x) dx
If this can be inverted you get
y = F^{-1}(C)
So call rand() and plug the result in as C in the last line and use y.
This result is called the Fundamental Theorem of Sampling. This is a hassle because of the normalization requirement and the need to analytically invert the function.
Alternately you can use a rejection technique: throw a number uniformly in the desired range, then throw another number and compare to the PDF at the location indeicated by your first throw. Reject if the second throw exceeds the PDF. Tends to be inefficient for PDFs with a lot of low probability region, like those with long tails...
An intermediate approach involves inverting the CDF by brute force: you store the CDF as a lookup table, and do a reverse lookup to get the result.
The real stinker here is that simple x^-n distributions are non-normalizable on the range [0,1], so you can't use the sampling theorem. Try (x+1)^-n instead...
I just wanted to carry out an actual simulation as a complement to the (rightfully) accepted answer. Although in R, the code is so simple as to be (pseudo)-pseudo-code.
One tiny difference between the Wolfram MathWorld formula in the accepted answer and other, perhaps more common, equations is the fact that the power law exponent n (which is typically denoted as alpha) does not carry an explicit negative sign. So the chosen alpha value has to be negative, and typically between 2 and 3.
x0 and x1 stand for the lower and upper limits of the distribution.
So here it is:
set.seed(0)
x1 = 5 # Maximum value
x0 = 0.1 # It can't be zero; otherwise X^0^(neg) is 1/0.
alpha = -2.5 # It has to be negative.
y = runif(1e7) # Number of samples
x = ((x1^(alpha+1) - x0^(alpha+1))*y + x0^(alpha+1))^(1/(alpha+1))
plot(density(x), ylab="log density x", col=2)
or plotted in logarithmic scale:
plot(density(x), log="xy", ylab="log density x", col=2)
Here is the summary of the data:
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1000 0.1208 0.1584 0.2590 0.2511 4.9388
I can't comment on the math required to produce a power law distribution (the other posts have suggestions) but I would suggest you familiarize yourself with the TR1 C++ Standard Library random number facilities in <random>. These provide more functionality than std::rand and std::srand. The new system specifies a modular API for generators, engines and distributions and supplies a bunch of presets.
The included distribution presets are:
uniform_int
bernoulli_distribution
geometric_distribution
poisson_distribution
binomial_distribution
uniform_real
exponential_distribution
normal_distribution
gamma_distribution
When you define your power law distribution, you should be able to plug it in with existing generators and engines. The book The C++ Standard Library Extensions by Pete Becker has a great chapter on <random>.
Here is an article about how to create other distributions (with examples for Cauchy, Chi-squared, Student t and Snedecor F)