Creating a Nested List using for loops - tuples

This is related to a homework problem for an Inverse Theory Programming class:
We are given a small dataset with x and y values:
x y
---- ----
0.5 1.2
1 1.8
1.5 1.7
2 3
2.5 3.5
3 3.2
3.5 4.5
4 4.8
4.5 5.3
5 6.2
xdata = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
ydata = [1.2, 1.8, 1.7, 3.0, 3.5, 3.2, 4.5, 4.8, 5.3, 6.2]
The ultimate goal to find the best fit line for said data using a grid searching technique. The following code produces a grid with m and b values (y=mx+b).
gridm = np.linspace(0.4,0.8,10)
gridb = np.linspace(-1,1,10)
def makegrid(dim1, dim2):
return list(itertools.product(dim1,dim2))
grid = makegrid(gridm,gridb)
itertools.product returns a tuple list with an m and b value. I want to take every m and b pair in the grid and calculate a y-value for every x-value in xdata. I use a for loop to iterate over every pair in the grid but manually define the x-values in data:
def linear(m,x,b):
y = (m*x)+b
return y
ypred = []
ypred1 = []
ypred2 = []
for i,j in grid:
ypred1.append(linear(i,xdata[0],j))
ypred2.append(linear(i,xdata[1],j))
ypred = [ypred1,ypred2]
Instead I would like to iterate over the xdata, which I believe is something like this:
ypred = []
for i,j in grid:
for k in xdata:
yb = linear(i,k,j)
ypred.append(yb)
this returns a list:
[y1,y2,y3,y4,y5,y6,etc]
I would like create a nested list which contains all calculated y-values from the tuples in grid for each x-value in xdata. In other words I am looking for something like this:
lst = [[m1*x1+b1, m1*x1+b2, m1*x1+b3, ..., m4*x1+b1... ], [m1*x2+b1, m1*x2+b2, m1*x2+b3, ..., m4*x2+b1...], [m1*x3+b1, m1*x3+b2, m1*x3+b3, ..., m4*x3+b1...], etc ]
where,
xdata = [x1,x2,x3,x4,x5,etc]
Thanks.
Note: ypred returns something that looks like lst, I would really like to know how to do this iteratively with for loops

Related

Derivative in JAX and Sympy not coinciding

For this vectorial function I want to evaluate the jacobian:
import jax
import jax.numpy as jnp
def myf(arr, phi_0, phi_1, phi_2, lambda_0, R):
arr = jnp.deg2rad(arr)
phi_0 = jnp.deg2rad(phi_0)
phi_1 = jnp.deg2rad(phi_1)
phi_2 = jnp.deg2rad(phi_2)
lambda_0 = jnp.deg2rad(lambda_0)
n = jnp.sin(phi_1)
F = 2.0
rho_0 = 1.0
rho = R*F*(1/jnp.tan(jnp.pi/4 + arr[1]/2))**n
x_L = rho*jnp.sin(n*(arr[0] - lambda_0))
y_L = rho_0 - rho*jnp.cos(n*(arr[0] - lambda_0))
return jnp.array([x_L,y_L])
arr = jnp.array([-18.1, 29.9])
jax.jacobian(myf)(arr, 29.5, 29.5, 29.5, -17.0, R=1)
I obtain
[[ 0.01312758 0.00014317]
[-0.00012411 0.01514319]]
I'm in shock with these values. Take for instance the element [0][0], 0.01312758. We know it's the partial of x_L with respect to the variable arr[0]. Whether by hand or using sympy that derivative is ~0.75.
from sympy import *
x, y = symbols('x y')
x_L = (2.0*(1/tan(3.141592/4 + y/2))**0.492)*sin(0.492*(x + 0.2967))
deriv = Derivative(x_L, x)
deriv.doit()
deriv.doit().evalf(subs={x: -0.3159, y: 0.52})
0.752473089673695
(inserting x, y, that are arr[0] and arr[1] already in radians). This is also the result I obtain by hand. What is happening with Jax results? I can't see what I'm doing bad.
Your JAX snippet inputs degrees, and so its gradient has units of 1/degrees, while your sympy snippet inputs radians, and so its gradient has units of 1/radians. If you convert the jax outputs to 1/radians (i.e. multiply the jax outputs by 180 / pi), you'll get the result you're looking for:
result = jax.jacobian(myf)(arr, 29.5, 29.5, 29.5, -17.0, R=1)
print(result * 180 / jnp.pi)
[[ 0.7521549 0.00820279]
[-0.00711098 0.8676407 ]]
Alternatively, you could rewrite myf to accept inputs in units of radians and get the expected result by taking its gradient directly.
Ok, I think I know what is happening... it is subtle.
The problem is that the conversion from degrees to rad done inside the function is not valid for jax. I think (but surely there're people who know more than me) that jax does the derivatives as soon as jax.jacobian(myf) is called and it evaluates only at last, when the values are passed (lazy evaluation, I think), so the transformation of values inside the function doesn't do anything. The correct code will be
def myf(arr, phi_0, phi_1, phi_2, lambda_0, R):
n = jnp.sin(phi_1)
F = 2.0
rho_0 = 1.0
rho = R*F*(1/jnp.tan(jnp.pi/4 + arr[1]/2))**n
x_L = (R*F*(1/jnp.tan(jnp.pi/4 + arr[1]/2))**n) *jnp.sin(n*(arr[0] - lambda_0))
y_L = rho_0 - (R*F*(1/jnp.tan(jnp.pi/4 + arr[1]/2))**n) *jnp.cos(n*(arr[0] - lambda_0))
return jnp.array([x_L,y_L])
arr = jnp.array([-18.1, 29.9])
jax.jacobian(myf)(jnp.deg2rad(arr), jnp.deg2rad(29.5),
jnp.deg2rad(29.5), jnp.deg2rad(29.5), jnp.deg2rad(-17.0),
R=1)
# [[ 0.7521549 0.00820279]
# [-0.00711098 0.8676407 ]]

Python curve fitting on a barplot

How do I fit a curve on a barplot?
I have an equation, the diffusion equation, which has some unknown parameters, these parameters make the curve larger, taller, etc. On the other hand I have a barplot coming from a simulation. I would like to fit the curve on the barplot, and find the best parameters for the curve, how can I do that?
This is what I obtained by 'manual fitting', so basically I changed manually all the parameters for hours. However is there a way to do this with python?
To make it simple, imagine I have the following code:
import matplotlib.pyplot as plt
list1 = []
for i in range(-5,6):
list1.append(i)
width = 1/1.5
list2 = [0,0.2,0.6,3.5,8,10,8,3.5,0.6,0.2,0]
plt.bar(list1,list2,width)
plt.show()
T = 0.13
xx = np.arange(-6,6,0.01)
yy = 5*np.sqrt(np.pi)*np.exp(-((xx)**2)/(4*T))*scipy.special.erfc((xx)/(2*np.sqrt(T))) + np.exp(-((xx)**2)/(4*T))
plt.plot(xx,yy)
plt.show()
Clearly the fitting here would be pretty hard, but anyway, is there any function or such that allows me to find the best coefficients for the equation: (where T is known)
y = A*np.sqrt(np.pi*D)*np.exp(-((x-E)**2)/(4*D*T))*scipy.special.erfc((x-E)/(2*np.sqrt(D*T))) + 300*np.exp(-((x-E)**2)/(4*D*T))
EDIT: This is different from the already asked question and from the scipy documentation example. In the latter the 'xdata' is the same, while in my case it might and might not be. Furthermore I would also be able to plot this curve fitting, which isn't shown on the documentation. The height of the bars is not a function of the x's! So my xdata is not a function of my ydata, this is different from what is in the documentation.
To see what I mean try to change the code in the documentation a little bit, to fall into my example, try this:
def func(x,a,b,c):
return a * np.exp(-b * x) + c
xdata = np.linspace(0,4,50)
y = func(xdata, 2.5, 1.3, 0.5)
ydata = [1,6,3,4,6,7,8,5,7,0,9,8,2,3,4,5]
popt, pcov = curve_fit(func,xdata,ydata)
if you run this, it doesn't work. The reason is that I have 16 elements for the ydata and 50 for the function. This happens because y takes values from xdata, while ydata takes values from another set of x values, which is here unknown.
Thank you
I stand by my thinking that this question is a duplicate. Here is a brief example of the typical workflow using curve_fit. Let me know if you still think that your situation is different.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
# bar plot data
list1 = range(-5, 6)
list2 = [0, 0.2, 0.6, 3.5, 8, 10,
8, 3.5, 0.6, 0.2, 0]
width = 1/1.5
plt.bar(list1, list2, width, alpha=0.75)
# fit bar plot data using curve_fit
def func(x, a, b, c):
# a Gaussian distribution
return a * np.exp(-(x-b)**2/(2*c**2))
popt, pcov = curve_fit(func, list1, list2)
x = np.linspace(-5, 5, 100)
y = func(x, *popt)
plt.plot(x + width/2, y, c='g')

Method for evaluating the unit vector ( or normalising a vector ) in Python or in the numerical libraries: numpy, scipy [duplicate]

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x

Random Forest - or other Machine Learning - with Different number of features

I am trying to compare a list of numbers with another list of lists to see how many of them match fairly closely. However each of my data sets could have a different length.
As an example, if I had a list of time spent studying, student 1 might have
1 - [ 10.0, 25.0, 15.7, 45.0]
and be compared against the list of other students that were
2 - [ 9.0, 30.0, 3.0]
3 - [ 26.0, 44.0]
4 - [ 5.0, 70.0, 90.0, 100.0]
5 - [ 9.0, 27.0, 13.7, 42.0, 56.0, 60.0, 75.0]
I would want the comparison to score highly comparing study 1 vs 5 because there were 4 times that all scored well, even though student 5 had extra times that student 1 didn't have, and I would want it to score fairly well for student 1 vs 3 because some of the numbers matched closely, even though some did not
I am just getting started with machine learning, and am only passingly familiar with Random Forests. Can you use them to do this type of comparison or do they have to have the same parameters ? Can you suggest a different method ?
Effectively what I am looking for is an intersection of sets, with loose parameters. I would like to implement this in python
Thank you!
Normalization
Start by first normalizing the data in the range 0 to 1. This can be done using the following formula.
Norm(e) = (e - Emin) / (Emax - Emin)
for each value e in each vector. (I don't know how to put math symbols in here or I would.)
So for example the first vector would become...
1 - [ 10.0, 25.0, 15.7, 45.0]
Norm(10.0) = (10.0 - 10.0) / (45.0 - 10.0) = 0.0
1 - [ 0.0, 25.0, 15.7, 45.0]
Norm(25.0) = (25.0 - 10.0) / 35.0 = 15/35 = 3/7 ~= 0.42857142
1 - [ 0.0, 0.42857142, 15.7, 45.0]
...
1 - [ 0.0, 0.42857142, 0.30571428, 1.0]
Do this for every vector and then calculate the mean squared error of each pair
adding/removing necessary 0's. This should give you a pretty good scoring mechanism. If you need to you can also split a 1.0 into 2 0.5 entries.
Mean squared error
You can calculate the mean squared error using the following equation.
Where n is the number of elements in each vector and Y hat, Y are the two vectors which you are looking to get the MSE for.
in code the function would look something like...
public long getMSE(long[] v1, long[] v2) {
long returnValue = 0.0L;
for (int i = 0; i < v1.length; i++) {
returnValue += Math.pow(v1[i] - v2[i], 2);
}
return (long) (returnValue / v1.length);
}

Efficient Calculation of Determinant on change for 3X3 matrix?

Let's say you were given a 3x3 matrix of values and only a particular column changes values between iterations.
On Step 1 I have X:
[2.0, 4.6, 3.4]
[3.2, 6.7, 4.1]
[2.1, 1.4, 5.3]
Whose determinant is -11.476
Now on Step 2 I have X, with the second column populated with new values.
[2.0, 6.5, 3.4]
[3.2, 3.4, 4.1]
[2.1, 0.8, 5.3]
Is there a quick way to calculate the determinant of this matrix given the previous state of the matrix and its previous determinant? I want to preserve some of the information known at the previous state. Only columns change on each iteration.
If it's always the same column that changes you can use Laplace expansion with respect to that column.
If your first and third columns are constant and only second column varies, then you can transform this into a formula:
[2.0, a, 3.4]
[3.2, b, 4.1]
[2.1, c, 5.3]
= -a x [3.2, 4.1][2.1, 5.3] + b x [2.0, 3.4][2.1, 5.3] - c x [2.0, 3.4][3.2, 4.1]
= -8.35 x a + 3.46 x b + 2.68 x c
So now you have a formula that you can use.