Pandas apply function taking up to 10min (numba doesnot help)

Pandas apply function taking up to 10min (numba doesnot help) - python-2.7

I have got a very simple function to apply to each row of my dataframe:
def distance_ot(fromwp,towp,pl,plee):
` if fromwp[0:3]==towp[0:3]:
sxcord=pl.loc[fromwp,"XCORD"]
sycord=pl.loc[fromwp,"YCORD"]
excord=pl.loc[towp,"XCORD"]
eycord=pl.loc[towp,"YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distance=x+y
return distance
else:
x1=np.abs(plee.loc[fromwp[0:3],"exitx"]-pl.loc[fromwp,"XCORD"])
y1=np.abs(plee.loc[fromwp[0:3],"exity"]-pl.loc[fromwp,"YCORD"])
x2=np.abs(plee.loc[fromwp[0:3],"exitx"]-plee.loc[towp[0:3],"entryx"])
y2=np.abs(plee.loc[fromwp[0:3],"exity"]-plee.loc[towp[0:3],"entryy"])
x3=np.abs(plee.loc[towp[0:3],"entryx"]-pl.loc[towp,"XCORD"])
y3=np.abs(plee.loc[towp[0:3],"entryy"]-pl.loc[towp,"YCORD"])
distance=x1+x2+x3+y1+y2+y3
return distance
With this line it is called:
pot["traveldistance"]=pot.apply(lambda row: distance_ot(fromwp=row["from_wpadr"],towp=row["to_wpadr"],pl=pl,plee=plee),axis=1)
Where: fromwp and towp are both strings and xcord and ycord are floats. I tried using numba but for some reasons it does not enhance this performance. Any suggestions?
Thanks to caiohamamura hint hereby the solution:
distance_ot(pl=pl,plee=plee)
pot.ix[pot.from_wpadr.str[0:3]==pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance1"]
pot.ix[pot.from_wpadr.str[0:3]!=pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance2"]
def distance_ot(pl,plee):
from_df = pl.loc[pot["from_wpadr"]]
to_df = pl.loc[pot["to_wpadr"]]
sxcord=from_df["XCORD"].values
sycord=from_df["YCORD"].values
excord=to_df["XCORD"].values
eycord=to_df["YCORD"].values
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
pot["distance1"]=x+y
from_df2=plee.loc[pot["from_wpadr"].str[0:3]]
to_df2=plee.loc[pot["to_wpadr"].str[0:3]]
x1=np.abs(from_df2["exitx"].values-from_df["XCORD"].values)
y1=np.abs(from_df2["exity"].values-from_df["YCORD"].values)
x2=np.abs(from_df2["exitx"].values-to_df2["entryx"].values)
y2=np.abs(from_df2["exity"].values-to_df2["entryy"].values)
x3=np.abs(to_df2["entryx"].values-to_df["XCORD"].values)
y3=np.abs(to_df2["entryy"].values-to_df["YCORD"].values)
pot["distance2"]=x1+x2+x3+y1+y2+y3

Vectorize the distance_ot function to calculate all distances at once. I would begin populating a from_df and a to_df like the following:
import numpy as np
from_df = pl.loc[np.in1d(pl.loc.index, pot["from_wpadr"])
to_df = pl.loc[np.in1d(pl.loc.index, pot["to_wpadr"])
Then you can continue as in your function:
sxcord=from_df["XCORD"]
sycord=from_df["YCORD"]
excord=to_df["XCORD"]
eycord=to_df["YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distances=x+y
This will calculate all the distances at once. Your if clause can also be vectorized, your results will be calculated into different arrays, those which match the if clause and those that doesn't, you just have to keep track of the boolean array, so you can put them together in the dataframe afterwards:
first_three_equals = np.char.ljust(pot["from_wpadr"].values.astype(str), 3) \
== np.char.ljust(pot["to_wpadr"].values.astype(str), 3)

Related

Conversion of nested for loop into lambda or pandas join/merge format

I have two dataframes one has 4.7 million rows and other has 1 million rows. I need to join those two data frames base upon some conditions. but using for loop the operation takes a lot of time. How to convert my for loop code into efficient pandas query?
su_rating_range = [0] *len(tb_su_name)
for x in xrange(len(tb_su_name)):
print "tb count--",x
for y in xrange(len(su_su_name)):
if tb_su_name[x] == su_su_name[y] and tb_year_week[x] == su_year_week[y] and tb_tg_mkt[x] == su_tg_mkt[y]:
print "su count--",y
su_rating_range[x] = su_ratings[y]
tb_concate_ratings["LAG_RATING_su"] = su_rating_range

Using the merge function of pandas you can try :
result = tb.merge(su, how = 'inner', on = ['su_name', 'year_week', 'tg_mkt'])

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)

have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.

A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice

KeyError: Not in index, using a keys generated from a Pandas dataframe on itself

I have two columns in a Pandas DataFrame that has datetime as its index. The two column contain data measuring the same parameter but neither column is complete (some row have no data at all, some rows have data in both column and other data on in column 'a' or 'b').
I've written the following code to find gaps in columns, generate a list of indices of dates where these gaps appear and use this list to find and replace missing data. However I get a KeyError: Not in index on line 3, which I don't understand because the keys I'm using to index came from the DataFrame itself. Could somebody explain why this is happening and what I can do to fix it? Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)

Whenever you are considering performing assignment then you should use .loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
The error in your original code is the ordering of the subscript values for the index lookup:
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
will produce an index error, I get the error on a toy dataset: IndexError: indices are out-of-bounds
If you changed the order to this it would probably work:
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index]
However, this is chained assignment and should be avoided, see the online docs
So you should use loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
df.loc[notnull_index, 'DOC_mg/L'] = df['TOC_mg/L']
note that it is not necessary to use the same index for the rhs as it will align correctly

using function wheen looping through dataframe python/pandas

I have a function that uses two colomns in a dataframe:
def create_time(var, var1):
if var == "Helår":
y = var1+'Q4'
else:
if var == 'Halvår':
y = var1+'Q2'
else:
y = var1+'Q'+str(var)[0:1]
return y
Now i want to loop hrough my dataframe, creatring a new column using the function, where var and var1 are columns in the dataframe
I try with the following, but have no luck:
for row in bd.iterrows():
A = str(bd['Var'])
B = str(bd['Var1'])
bd['period']=create_time(A,B)

Looping is a last resort. There is usually a "vectorized" way to operate on the entire DataFrame, which always faster and usually more readable too.
To apply your custom function to each row, use apply with the keyword argument axis=1.
bd['period'] = bd[['Var', 'Var1']].apply(lambda x: create_time(*x), axis=1)
You might wonder why it's not just bd.apply(create_time). Since create_time wants two arguments, we have to "unpack" the row x into its two values and pass those to the function.

Selecting elements in numpy array using regular expressions

One may select elements in numpy arrays as follows
a = np.random.rand(100)
sel = a > 0.5 #select elements that are greater than 0.5
a[sel] = 0 #do something with the selection
b = np.array(list('abc abc abc'))
b[b==a] = 'A' #convert all the a's to A's
This property is used by the np.where function to retrive indices:
indices = np.where(a>0.9)
What I would like to do is to be able to use regular expressions in such element selection. For example, if I want to select elements from b above that match the [Aab] regexp, I need to write the following code:
regexp = '[Ab]'
selection = np.array([bool(re.search(regexp, element)) for element in b])
This looks too verbouse for me. Is there any shorter and more elegant way to do this?

There's some setup involved here, but unless numpy has some kind of direct support for regular expressions that I don't know about, then this is the most "numpytonic" solution. It tries to make iteration over the array more efficient than standard python iteration.
import numpy as np
import re
r = re.compile('[Ab]')
vmatch = np.vectorize(lambda x:bool(r.match(x)))
A = np.array(list('abc abc abc'))
sel = vmatch(A)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pandas apply function taking up to 10min (numba doesnot help) - python-2.7

Related

Conversion of nested for loop into lambda or pandas join/merge format

Deleting duplicate x values and their corresponding y values

KeyError: Not in index, using a keys generated from a Pandas dataframe on itself

using function wheen looping through dataframe python/pandas

Selecting elements in numpy array using regular expressions

Categories

Resources