Iterating operations over unique values of an array - python-2.7

I have a pandas dataframe that resembles one generated as follows.
import numpy as np
import pandas as pd
x0 = pd.DataFrame(np.random.normal(size=(10, 4)))
x1 = pd.DataFrame({'x': [1,1,2,3,2,3,4,1,2,3]})
df = pd.concat((x0, x1), axis=1)
and a function:
def fun(df, n=100):
z = np.random.normal(size=n)
return np.dot(df[[0,1,2,3]], [0.5*z,-1*z,0.3*z,1.2*z])
I would like to:
use identical draws z for each unique value in x,
take the product of the output in the above step over items of unique x
Any suggestion?
Explanation:
Generate n=100 draws to get z such that len(z)=100
For each elem in z, evaluate the function fun,
For i in df.x.unique(), compute the product of the output in step (2) element-wise. I am expecting to get a DataFrame or array of dimension (len(df.x.unique(), n=100)
4.

It sounds like you want to group by 'x', taking one of its instances (let's assume we take the first one observed).
just call your function as follows:
f = fun(df.groupby('x').first())
>>> f.shape
Out[25]: (4, 100)
>>> len(df.x.unique()
Out[26]: 4

Related

How to handle PySpark UDF return values in different types?

I have a data frame with one column. In each row of this data frame, there is a list. For example :
df = spark.createDataFrame(
[
[[13,23]],
[[55,65]],
],
['col',]
)
Then I defined a UDF which basically adds 1 to first number in the list and add 1.5 to the second number of the list.
def calculate(mylist) :
x = mylist[0] + 1
y = mylist[1] + 1.5
return x,y
The problem is that when I apply this function to my data frame it returns the X value but it does not return the Y value.
I think it is because the Y value is not an integer.
This is the way that I do this.
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
func = F.udf(lambda x: calculate(x), ArrayType(IntegerType()))
df.withColumn('vals', func('col')).show()
What can I do to get the Y value as well as X value? I simplified the UDF and sample data frame for the sake of being easy to understand and solve.
calculate udf is returning integer and also float type with the given input.
If your use case first value is integer and second value is float, you can return StructType
If both need to be same type, you can use the same code and change calculate udf which returns both integers
func = F.udf(lambda x: calculate(x), T.StructType(
[T.StructField("val1", T.IntegerType(), True),
T.StructField("val2", T.FloatType(), True)]))

matplotlib subplot not working as expected

I have the following code:
import matplotlib.pyplot as plt
horas = [1,2,3,4]
diccionario = {(1,1,2,1):[2,3,4,5],
(1,2,2,2):[2,5,1,5],
(1,3,2,3):[2,5,5,5],
(1,4,2,4):[2,6,8,5],
(1,5,2,5):[2,7,5,5],
(1,6,2,6):[2,8,2,5],
(1,7,2,7):[2,9,6,5],
(1,8,2,8):[2,4,9,5]}
plt.figure()
i = 1
maximo = 0
keys = diccionario.keys()
for n in range(0,len(keys)-1,2):
gn, = plt.plot(horas,diccionario[keys[n]],'ro-')
gn1, = plt.plot(horas,diccionario[keys[n+1]],'g*-')
plt.subplot(len(keys)//2, 1,i)
plt.legend([gn,gn1], [keys[n],keys[n+1]])
i+=1
plt.show()
I expect to have 4 subplots with two lines each. I have them, but the last one is empty.
Could anyone explain why? I have tried many different ways without succeeding.
Put your subplot() before you plot gn and gn1. That will solve your problem.
for n in range(0, len(keys) - 1,2):
plt.subplot(len(keys)//2, 1, i)
gn, = plt.plot(horas, diccionario[keys[n]], 'ro-')
gn1, = plt.plot(horas, diccionario[keys[n+1]], 'g*-')
plt.legend([gn, gn1], [keys[n], keys[n+1]])
i+=1
By the way, I recommend to use tuple instead of dict. You may notice that the sequence of results is quite different from what you want.

Lmfit separate peak fitting

I'm very new to curve/peak fitting, but I am trying to fit a data set with multiple separate independent peaks. I've tried something similar to the example provided by lmfit, and here's my code:
import matplotlib.pyplot as plt
from lmfit.models import GaussianModel
from numpy import loadtxt
data = loadtxt('079-55.freq')
x = data[:, 0]
y = data[:, 1]
gauss1 = GaussianModel(prefix='g1_')
pars = gauss1.make_params()
pars['g1_center'].set(4100, min=2000, max=4500)
pars['g1_amplitude'].set(170, min=10)
gauss2 = GaussianModel(prefix='g2_')
pars.update(gauss2.make_params())
pars['g2_center'].set(4900, min=4500, max=5500)
pars['g2_amplitude'].set(30, min=10)
gauss3 = GaussianModel(prefix='g3_')
pars.update(gauss3.make_params())
pars['g3_center'].set(600, min=5500, max=10000)
pars['g3_amplitude'].set(13, min=10)
mod = gauss1 + gauss2 + gauss3
init = mod.eval(pars, x=x)
plt.plot(x, init, 'k--')
out = mod.fit(y, pars, x=x)
print(out.fit_report())
plt.plot(x, out.best_fit, 'r-')
plt.plot(x, y)
plt.show()
However, the result becomes something like this:
I am very confused as to how to proceed to fit three separate peaks as shown below. I think the parameter update is for pitting multiple model into the same data set, not for separate independent peaks. I could be wrong though. Is there any suggestions?
pars['g3_center'].set(600, min=5500, max=10000)
Probably confuses the parameter or model class as 600 is not within the bounds of min and max.

Rolling Window on Dataframe, mutliple columns input and output

I have a function myfunc, which does calculations on two pandas DataFrame columns. Output is a Numpy array.
def myfunc(df, args):
import numpy
return numpy.array([df.iloc[:,args[0]].sum,df.iloc[:,args[1]].sum])
This function is called within rolling_df_apply:
def rolling_df_apply(df, myfunc, window, *args):
import pandas
result = pandas.concat(pandas.DataFrame(myfunc(df.iloc[i:window+i],args), index=[df.index[i+window-1]]) for i in xrange(0,len(df)-window+1))
return result
Running this via
import numpy
import pandas
df=pandas.DataFrame(numpy.random.randint(5,size=(5,2)))
window=3
args = [0,1]
result = rolling_df_apply(df, myfunc, window, *args)
gives ValueError within pandas.concat(): Shape of passed values is (1, 2), indices imply (1, 1).
What must be changed to get this running?
Which indices imply shape 1,1? Shape of all dataframes to concatenate should be 1,2, though.
In myfunc, .sum should be .sum() in myfunc.
Since myfunc returns an array of length 2,
pandas.DataFrame(myfunc(df.iloc[i:window+i],args), index=[df.index[i+window-1]])
is essentially the same as
pd.DataFrame([0,1], index=[0])
which raises
ValueError: Shape of passed values is (1, 2), indices imply (1, 1)
The error is saying that the value [0,1] implies 1 row and 2 columns,
while the index implies 1 row and 1 column.
On way to fix this would be to pass a dict instead of a list:
In [191]: pd.DataFrame({'a':0,'b':1}, index=[0])
Out[191]:
a b
0 0 1
So, to fix your code with minimal changes,
import pandas as pd
import numpy as np
def myfunc(df, args):
return {'a':df.iloc[:,args[0]].sum(), 'b':df.iloc[:,args[1]].sum()}
def rolling_df_apply(df, myfunc, window, *args):
frames = [pd.DataFrame(myfunc(df.iloc[i:window+i],args),
index=[df.index[i+window-1]])
for i in xrange(0,len(df)-window+1)]
result = pd.concat(frames)
return result
np.random.seed(2015)
df = pd.DataFrame(np.random.randint(5,size=(5,2)))
window=3
args = [0,1]
result = rolling_df_apply(df, myfunc, window, *args)
print(result)
yields
a b
2 7 6
3 7 5
4 3 3
However, it would be much more efficient to replace myfunc and rolling_df_apply with a call to pd.rolling_sum:
result = pd.rolling_sum(df, window=3).dropna(axis=0)
yields the same result.

Returning np.array or np.matrix objects in a theano function

I have to do something like this.
import theano as th
import theano.tensor as T
x, y = T.dscalars('x', 'y')
z = np.matrix([[x*y, x-y], [x/y, x**2/(2*y)]])
f = th.function([x, y], z) # causes error
# next comes calculations like f(2, 1)*f(3, 2)*some_matrix
I know the last line is not a valid code as th.function doesn't support returning these objects. Is there an efficient way to do this without returning all elements of matrix and casting it as an np.matrix?
The problem with your approach is that z needs to be a list of theano variables not a numpy matrix.
You can achieve the same result using:
z1,z2,z3,z4 = x*y,x-y,x/y,x**2/(2*y)
f = th.function([x, y], [z1,z2,z3,z4])
def createz(z1,z2,z3,z4) :
return np.matrix([[z1,z2],[z3,z4]])
print(createz(*f(1,2)))