I have a function myfunc, which does calculations on two pandas DataFrame columns. Output is a Numpy array.
def myfunc(df, args):
import numpy
return numpy.array([df.iloc[:,args[0]].sum,df.iloc[:,args[1]].sum])
This function is called within rolling_df_apply:
def rolling_df_apply(df, myfunc, window, *args):
import pandas
result = pandas.concat(pandas.DataFrame(myfunc(df.iloc[i:window+i],args), index=[df.index[i+window-1]]) for i in xrange(0,len(df)-window+1))
return result
Running this via
import numpy
import pandas
df=pandas.DataFrame(numpy.random.randint(5,size=(5,2)))
window=3
args = [0,1]
result = rolling_df_apply(df, myfunc, window, *args)
gives ValueError within pandas.concat(): Shape of passed values is (1, 2), indices imply (1, 1).
What must be changed to get this running?
Which indices imply shape 1,1? Shape of all dataframes to concatenate should be 1,2, though.
In myfunc, .sum should be .sum() in myfunc.
Since myfunc returns an array of length 2,
pandas.DataFrame(myfunc(df.iloc[i:window+i],args), index=[df.index[i+window-1]])
is essentially the same as
pd.DataFrame([0,1], index=[0])
which raises
ValueError: Shape of passed values is (1, 2), indices imply (1, 1)
The error is saying that the value [0,1] implies 1 row and 2 columns,
while the index implies 1 row and 1 column.
On way to fix this would be to pass a dict instead of a list:
In [191]: pd.DataFrame({'a':0,'b':1}, index=[0])
Out[191]:
a b
0 0 1
So, to fix your code with minimal changes,
import pandas as pd
import numpy as np
def myfunc(df, args):
return {'a':df.iloc[:,args[0]].sum(), 'b':df.iloc[:,args[1]].sum()}
def rolling_df_apply(df, myfunc, window, *args):
frames = [pd.DataFrame(myfunc(df.iloc[i:window+i],args),
index=[df.index[i+window-1]])
for i in xrange(0,len(df)-window+1)]
result = pd.concat(frames)
return result
np.random.seed(2015)
df = pd.DataFrame(np.random.randint(5,size=(5,2)))
window=3
args = [0,1]
result = rolling_df_apply(df, myfunc, window, *args)
print(result)
yields
a b
2 7 6
3 7 5
4 3 3
However, it would be much more efficient to replace myfunc and rolling_df_apply with a call to pd.rolling_sum:
result = pd.rolling_sum(df, window=3).dropna(axis=0)
yields the same result.
Related
I have a data frame with one column. In each row of this data frame, there is a list. For example :
df = spark.createDataFrame(
[
[[13,23]],
[[55,65]],
],
['col',]
)
Then I defined a UDF which basically adds 1 to first number in the list and add 1.5 to the second number of the list.
def calculate(mylist) :
x = mylist[0] + 1
y = mylist[1] + 1.5
return x,y
The problem is that when I apply this function to my data frame it returns the X value but it does not return the Y value.
I think it is because the Y value is not an integer.
This is the way that I do this.
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
func = F.udf(lambda x: calculate(x), ArrayType(IntegerType()))
df.withColumn('vals', func('col')).show()
What can I do to get the Y value as well as X value? I simplified the UDF and sample data frame for the sake of being easy to understand and solve.
calculate udf is returning integer and also float type with the given input.
If your use case first value is integer and second value is float, you can return StructType
If both need to be same type, you can use the same code and change calculate udf which returns both integers
func = F.udf(lambda x: calculate(x), T.StructType(
[T.StructField("val1", T.IntegerType(), True),
T.StructField("val2", T.FloatType(), True)]))
I want to append HoG feature vectors to an empty matrix of unknown dimension. Is it required to specify the dimension of the matrix in advance? I have tried some code in python but it says all the input arrays must have same dimension.
import matplotlib.pyplot as plt
from skimage.feature import hog
from skimage import data, exposure, img_as_float
from skimage import data
import numpy as np
from scipy import linalg
import cv2
import glob
shape = (16576, 1)
X = np.empty(shape)
print X.shape
hog_image = np.empty(shape)
hog_image_rescaled = np.empty(shape)
for img in glob.glob("/home/madhuri/pythoncode/faces/*.jpg"):
n= cv2.imread(img)
gray = cv2.cvtColor(n, cv2.COLOR_RGB2GRAY)
hog_image = hog(gray, orientations=9, pixels_per_cell=(16, 16),
cells_per_block=(3, 3), visualise=False)
hog_image_rescaled = exposure.rescale_intensity(hog_image,
in_range=(0,10))
X = np.append(X, hog_image_rescaled, axis=1)
print 'X is'
print np.shape(X)
X = [] # use an 'empty' list
# hog_image = np.empty(shape) # no point initializing these variables
# hog_image_rescaled = np.empty(shape) # you just reassign them in the loop
for img in glob.glob("/home/madhuri/pythoncode/faces/*.jpg"):
n= cv2.imread(img)
gray = cv2.cvtColor(n, cv2.COLOR_RGB2GRAY)
hog_image = hog(gray, orientations=9, pixels_per_cell=(16, 16),
cells_per_block=(3, 3), visualise=False)
hog_image_rescaled = exposure.rescale_intensity(hog_image,
in_range=(0,10))
X.append(hog_image_rescaled)
Now X will be a list of rescaled images. Those elements can now be concatenated on which ever dimension is appropriate:
np.concatenate(X, axis=1)
np.stack(X)
# etc
The list model of
alist = []
for ....
alist.append(...)
does not translate well to arrays. np.append is a cover for np.concatenate, and makes a new array, which is more expensive than list append. And defining a good starting 'empty' array for such a loop is tricky. np.empty is not appropriate:
In [977]: np.empty((2,3))
Out[977]:
array([[1.48e-323, 1.24e-322, 1.33e-322],
[1.33e-322, 1.38e-322, 1.38e-322]])
In [978]: np.append(_, np.zeros((2,1)), axis=1)
Out[978]:
array([[1.48e-323, 1.24e-322, 1.33e-322, 0.00e+000],
[1.33e-322, 1.38e-322, 1.38e-322, 0.00e+000]])
I have built a wrapper around numpy array for simplification purposes I will display only the necessary part to show the error:
class Matrix(object):
"""wrap around numpy array
"""
def __init__(self, shape, fill_value):
self.matrix = np.full(shape, fill_value)
def __getitem__(self, a, b):
return self.matrix[a, b]
m = Matrix((10, 10), 5)
print(m[5, 5])
the print statement generates the following error:
KeyError: __getitem__() takes exactly 3 arguments (2 given)
what's the fix to access m using the [] operator like the follwing:
m[1, 1]
Currently, you have a class Matrix with an attribute matrix which is a numpy array. Therefore you would need to reference the attribute first and then pass the indices:
>>> m.matrix[5,5]
5
At this point, you have not wrapped around a numpy array. Depending on what you want to do, this could be a step in the right direction:
class Matrix(np.ndarray):
def __new__(cls, shape, fill_value=0):
return np.full(shape, fill_value)
>>> m = MyMatrix((10, 10), 5)
>>> print(m[5, 5])
>>> 5
However, this essentially does nothing more than m = np.full(shape, fill_value). I suppose you are going to want to add custom attributes and methods to a numpy array, in which you should check out this example in the numpy documentation.
the solution is to pass a tuple inside a variable like the following:
class Matrix(object):
"""wrap around numpy array
"""
def __init__(self, shape, fill_value):
self.matrix = np.full(shape, fill_value)
def __getitem__(self, a):
# we could do also do return self.matrix[a[0], a[1]]
return self.matrix[a]
m = Matrix((10, 10), 5)
print(m[5, 5])
I have a pandas dataframe that resembles one generated as follows.
import numpy as np
import pandas as pd
x0 = pd.DataFrame(np.random.normal(size=(10, 4)))
x1 = pd.DataFrame({'x': [1,1,2,3,2,3,4,1,2,3]})
df = pd.concat((x0, x1), axis=1)
and a function:
def fun(df, n=100):
z = np.random.normal(size=n)
return np.dot(df[[0,1,2,3]], [0.5*z,-1*z,0.3*z,1.2*z])
I would like to:
use identical draws z for each unique value in x,
take the product of the output in the above step over items of unique x
Any suggestion?
Explanation:
Generate n=100 draws to get z such that len(z)=100
For each elem in z, evaluate the function fun,
For i in df.x.unique(), compute the product of the output in step (2) element-wise. I am expecting to get a DataFrame or array of dimension (len(df.x.unique(), n=100)
4.
It sounds like you want to group by 'x', taking one of its instances (let's assume we take the first one observed).
just call your function as follows:
f = fun(df.groupby('x').first())
>>> f.shape
Out[25]: (4, 100)
>>> len(df.x.unique()
Out[26]: 4
I am getting an error and I'm not sure how to fix it.
The following seems to work:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df.apply(func = random, axis = 1)
and my output is:
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
However, when I change one of the of the columns to a value such as 1 or None:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df['E'] = 1
df.apply(func = random, axis = 1)
I get the the error:
ValueError: Shape of passed values is (5,), indices imply (5, 5)
I've been wrestling with this for a few days now and nothing seems to work. What is interesting is that when I change
def random(row):
return [1,2,3,4]
to
def random(row):
print [1,2,3,4]
everything seems to work normally.
This question is a clearer way of asking this question, which I feel may have been confusing.
My goal is to compute a list for each row and then create a column out of that.
EDIT: I originally start with a dataframe that hase one column. I add 4 columns in 4 difference apply steps, and then when I try to add another column I get this error.
If your goal is add new column to DataFrame, just write your function as function returning scalar value (not list), something like this:
>>> def random(row):
... return row.mean()
and then use apply:
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 -1.278878
1 -0.198460 0.544879 0.554407 -0.161357 0.184867
2 0.269807 1.132344 0.120303 -0.116843 0.351403
3 -1.131396 1.278477 1.567599 0.483912 0.549648
4 0.288147 0.382764 -0.840972 0.838950 0.167222
I don't know if it possible for your new column to contain lists, but it deinitely possible to contain tuples ((...) instead of [...]):
>>> def random(row):
... return (1,2,3,4,5)
...
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 (1, 2, 3, 4, 5)
1 -0.198460 0.544879 0.554407 -0.161357 (1, 2, 3, 4, 5)
2 0.269807 1.132344 0.120303 -0.116843 (1, 2, 3, 4, 5)
3 -1.131396 1.278477 1.567599 0.483912 (1, 2, 3, 4, 5)
4 0.288147 0.382764 -0.840972 0.838950 (1, 2, 3, 4, 5)
I use the code below it is just fine
import numpy as np
df = pd.DataFrame(np.array(your_data), columns=columns)