I have a basic understanding using numpy to create a matrix, but the context in which I have to create one confuses me. For example, I need to create a 2X1000 matrix with normally distributed values with mean 0 and standard deviation of 1. I'm not sure what it means to make a matrix with these conditions.
Besides what was writeen above by CoDEmanX, From numpy.random.normal we can read about the generic Normal Distribution in numpy:
numpy.random.normal(loc=0.0, scale=1.0, size=None)
Where:
loc is the Mean of the distribution and scale is the standard deviation (square root of variance).
import numpy as np
A = np.random.normal(loc =0, scale =1, size=(2, 1000))
print(A)
But if these examples are confusing, then consider that
np.random.normal()
simply gives you a random number, and you can create your own custom matrix:
import numpy as np
A = [ [np.random.normal() for i in range(1000)] for j in range(2) ]
A = np.array(A)
If you refer to the numpy docs, whether there are utility functions that facilitate you reaching your goal, you'll come across a normal distribution function:
numpy.random.standard_normal(size=None)
standard_normal(size=None)
Returns samples from a Standard Normal distribution (mean=0, stdev=1).
The simple average is 0, the standard deviation 1.
arr = numpy.random.standard_normal((2, 1000))
print(arr.mean()) # -0.027...
print(arr.std()) # 1.0272...
Note that it's not exactly 0 or 1.
I still recommend to read about normal distributation and standard deviation / variance, eventhough numpy offers a simple solution.
Related
I am recently learning Python Gekko and I am very very new to linear programming, so excuse my ignorance in certain topics.
I have a variable which should have a value of either 0 or should be greater than 20.
I later learnt that this is called a semi-continuous variable. My questions are as below
Is it possible to convert the above condition into a linear equation
By any chance does Gekko support the semi-continuous variables as I could not find anything about it in the documentation.
You can use the if3() function to enforce that constraint. That function uses a binary variable for the switch condition so it transforms the problem from a linear programming (LP) problem to a mixed integer linear programming (MILP) problem.
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
m = GEKKO()
p = m.Param(np.linspace(0,50))
y = m.if3(p-20,0,p)
m.options.IMODE=2
m.solve()
# plot solution
plt.plot(p.value,'r-',lw=3)
plt.plot(y.value,'b.-')
plt.show()
I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7
The script below illustrates the issue.
import numpy as np
import sklearn.linear_model as sklin
from sklearn.decomposition import PCA
n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)
pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)
print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )
There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.
Now as for your case, when size is larger than 500, it will choose randomized.
svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto :
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’ method is
enabled. Otherwise the exact full SVD is computed and optionally
truncated afterwards.
To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.
Try using
pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)
I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem
I'm no king in python, and recently got in trouble with a modification I made in my code. My algorithm is basically multiple uses of stochastic gradient algorithm and thus needs random variables.
I wanted my code to handle custom random variables and probability distribution. To do so, I modified my code and now I use scipy.stats to draw samples of custom random variables. Basically, I create a random variable with an imposed probability density or a cumulative density, and then draw samples thanks to the inverse function of the cumulative distribution function and some uniform random variable between [0,1].
To make it simple the algorithm runs multiple optimization from different starting point using stochastic gradient algorithm, and thus can be parallelized since the starting points are independent.
Problem is that the random variable created this way can't be pickled
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
I don't get the subtility of pickling problems for now, so if you guys can help me solve this following simple illustration of the problem :
RV = scipy.stats.norm();
def Draw(rv,N):
return rv.ppf(np.random.random(N))
pDraw = partial(Draw,RV);
PM = multiprocessing.pool(Processes = 2);
L = PM.map(pDraw,range(1,5));
I've heard of pathos library that do not use the same serialization algorithm (dill), but I would like to avoid this solution (if it is a solution) as it is not included in my python distribution at work... making it install will take a lot of time.
I referred following blog post while doing following code blocks https://prateekvjoshi.com/2015/12/15/how-to-compute-confidence-measure-for-svm-classifiers/ and I obtained following results. My intention find out the distance of a point from 3 classes in SVC of SVM in Scikit-learn, but I confused with the meaning described are there any solutions.
import numpy as np
from sklearn.svm import SVC
x = np.array([[1,2],[2,3],[3,4],[1,4],[1,5],[2,4],[2,6]])
y = np.array([0,1,-1,-1,1,1,0])
classifier = SVC(kernel='linear')
classifier.fit(x,y)
classifier.decision_function([2,1])
last call give the following output of array of size 3
array([[ -8.88178420e-16, -1.40000000e+00, -1.00000000e+00]])
what does this array meant for, how can we use this array to find out which out three class (-1,1,0) the particular data point related for.
It is the distance of the point [2,1] from the separating hyper-plane of SVM Classifier. So the first value is the distance of [2,1] from hyperplane separating the first class, so on and so forth. You can see the function's implementation here and read the documentation here for more info.
EDIT : You can also check out this example as well.
I have two different xarray datasets that have different latitude/longitude grid resolutions. I want to regrid the one xarray with lower resolution to the same resolution as the one xarray with higher resolution. I found some examples (e.g., http://earthpy.org/interpolation_between_grids_with_basemap.html), but it does not work for me. Here is one example that I made for testing:
import numpy as np
import xarray as xray
import mpl_toolkits.basemap
var1=xray.DataArray(np.random.randn(len(np.linspace(40.5,49.5,10)),len(np.linspace(-39.5,-20.5,20))),coords=[np.linspace(40.5,49.5,10), np.linspace(-39.5,-20.5,20)],dims=['lat','lon'])
(xlon, xlat)=np.meshgrid(np.linspace(-39.875,-20.125,80),np.linspace(40.125,49.875,40))
var2=xray.DataArray(-xlon**2+xlat**2,coords=[np.linspace(40.125,49.875,40),np.linspace(-39.875,-20.125,80)],dims=['lat','lon'])
mpl_toolkits.basemap.interp(var1,var1.lon,var1.lat,var2.lon,var2.lat,checkbounds=False,masked=False,order=0)
I get following error:
ValueError: xout and yout must have same shape!
Screenshot:
Does basemap.interp() require xout and yout to be the same shape? So var2 needs to be a square? This is almost never the case with any of my datasets! How can I regrid var1 to be the same resolution as var2?
Note: After regridding, I want to subsample var1 given some condition related to var2. For example:
var1_subset = var1.where(var2>1000)
So I want to minimize any loss of grid points during the interpolation.
basemap.interp will work only when xout and yout are same in number or number of output nlons and nlats are same,
why not generate same length output nlats and nlons and subset it later.
For example:
import numpy as np
import xarray as xray
import mpl_toolkits.basemap
var1=xray.DataArray(np.random.randn(len(np.linspace(40.5,49.5,10)),len(np.linspace(-39.5,-20.5,20))),coords=[np.linspace(40.5,49.5,10), np.linspace(-39.5,-20.5,20)],dims=['lat','lon'])
(xlon,xlat)=np.meshgrid(np.linspace(-39.875,20.125,80),np.linspace(40.125,49.875,80))
var2=xray.DataArray(-xlon**2+xlat**2,coords[np.linspace(40.125,49.875,80),np.linspace(-39.875,-20.125,80)],dims=['lat','lon'])
mpl_toolkits.basemap.interp(var1,var1.lon,var1.lat,var2.lon,var2.lat,checkbounds=False,masked=False,order=0)
Here is another cool trick with xarray.
lonreg=var1.groupby_bins('lon',np.linspace(-39.875,20.125,80)).mean(dim='lon')
regridded=lonreg.groupby_bins('lat',np.linspace(-39.5,20.5,20)).mean(dim='lat')
if you want weighted averaged regridding, it is easy to extend this for area averaged regridding by using weights and sum function on groupby object.