using function wheen looping through dataframe python/pandas - python-2.7

I have a function that uses two colomns in a dataframe:
def create_time(var, var1):
if var == "Helår":
y = var1+'Q4'
else:
if var == 'Halvår':
y = var1+'Q2'
else:
y = var1+'Q'+str(var)[0:1]
return y
Now i want to loop hrough my dataframe, creatring a new column using the function, where var and var1 are columns in the dataframe
I try with the following, but have no luck:
for row in bd.iterrows():
A = str(bd['Var'])
B = str(bd['Var1'])
bd['period']=create_time(A,B)

Looping is a last resort. There is usually a "vectorized" way to operate on the entire DataFrame, which always faster and usually more readable too.
To apply your custom function to each row, use apply with the keyword argument axis=1.
bd['period'] = bd[['Var', 'Var1']].apply(lambda x: create_time(*x), axis=1)
You might wonder why it's not just bd.apply(create_time). Since create_time wants two arguments, we have to "unpack" the row x into its two values and pass those to the function.

Related

Conversion of nested for loop into lambda or pandas join/merge format

I have two dataframes one has 4.7 million rows and other has 1 million rows. I need to join those two data frames base upon some conditions. but using for loop the operation takes a lot of time. How to convert my for loop code into efficient pandas query?
su_rating_range = [0] *len(tb_su_name)
for x in xrange(len(tb_su_name)):
print "tb count--",x
for y in xrange(len(su_su_name)):
if tb_su_name[x] == su_su_name[y] and tb_year_week[x] == su_year_week[y] and tb_tg_mkt[x] == su_tg_mkt[y]:
print "su count--",y
su_rating_range[x] = su_ratings[y]
tb_concate_ratings["LAG_RATING_su"] = su_rating_range
Using the merge function of pandas you can try :
result = tb.merge(su, how = 'inner', on = ['su_name', 'year_week', 'tg_mkt'])

Converting some elements of a list of a dataset to float

I'm reading in a csv file, and all rows contain string elements. One might be:
"orange", "2", "65", "banana"
I want to change this, within my dataset, to become:
row = ["orange", 2.0, 65.0, "banana"]
Here is my code:
data = f.read().split("\n")
for row in data:
for x in row:
if x.isdigit():
x = float(x)
print row
But it still prints the original rows like:
"orange", "2", "65", "banana"
I also want to achieve this without using list comprehensions (for now).
I believe it is because you cannot edit the row array in place like that. The x variable doesn't actually refer to the element in the array but a copy of it, so all changes you make evaporate after you're done iterating through the array.
I'm not sure if this is the idiomatic 'python way' of doing this but you could do:
data = f.read().split("\n")
for row in data:
parsed_row = []
for x in row:
if x.isdigit():
x = float(x)
parsed_row.append(x)
print parsed_row
Alternatively a more 'pythonic' way, as provided by JGreenwell in the comments, may be to allow an exception to be thrown if an element cannot be parsed to float.
data = f.read().split("\n")
for row in data:
parsed_row = []
for x in row:
try:
parsed_row.append(float(x))
except ValueError:
parsed_row.append(x)
print parsed_row
It really would come down to personal preference I imagine. Python exceptions shouldn't be slow so I wouldn't be concerned about that.
Perhaps is your delimiter. Try something like
with open('yourfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in reader:
for x in row:
if x.isdigit():
print(float(x))

Pandas apply function taking up to 10min (numba doesnot help)

I have got a very simple function to apply to each row of my dataframe:
def distance_ot(fromwp,towp,pl,plee):
` if fromwp[0:3]==towp[0:3]:
sxcord=pl.loc[fromwp,"XCORD"]
sycord=pl.loc[fromwp,"YCORD"]
excord=pl.loc[towp,"XCORD"]
eycord=pl.loc[towp,"YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distance=x+y
return distance
else:
x1=np.abs(plee.loc[fromwp[0:3],"exitx"]-pl.loc[fromwp,"XCORD"])
y1=np.abs(plee.loc[fromwp[0:3],"exity"]-pl.loc[fromwp,"YCORD"])
x2=np.abs(plee.loc[fromwp[0:3],"exitx"]-plee.loc[towp[0:3],"entryx"])
y2=np.abs(plee.loc[fromwp[0:3],"exity"]-plee.loc[towp[0:3],"entryy"])
x3=np.abs(plee.loc[towp[0:3],"entryx"]-pl.loc[towp,"XCORD"])
y3=np.abs(plee.loc[towp[0:3],"entryy"]-pl.loc[towp,"YCORD"])
distance=x1+x2+x3+y1+y2+y3
return distance
With this line it is called:
pot["traveldistance"]=pot.apply(lambda row: distance_ot(fromwp=row["from_wpadr"],towp=row["to_wpadr"],pl=pl,plee=plee),axis=1)
Where: fromwp and towp are both strings and xcord and ycord are floats. I tried using numba but for some reasons it does not enhance this performance. Any suggestions?
Thanks to caiohamamura hint hereby the solution:
distance_ot(pl=pl,plee=plee)
pot.ix[pot.from_wpadr.str[0:3]==pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance1"]
pot.ix[pot.from_wpadr.str[0:3]!=pot.to_wpadr.str[0:3],"traveldistance"]=pot["distance2"]
def distance_ot(pl,plee):
from_df = pl.loc[pot["from_wpadr"]]
to_df = pl.loc[pot["to_wpadr"]]
sxcord=from_df["XCORD"].values
sycord=from_df["YCORD"].values
excord=to_df["XCORD"].values
eycord=to_df["YCORD"].values
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
pot["distance1"]=x+y
from_df2=plee.loc[pot["from_wpadr"].str[0:3]]
to_df2=plee.loc[pot["to_wpadr"].str[0:3]]
x1=np.abs(from_df2["exitx"].values-from_df["XCORD"].values)
y1=np.abs(from_df2["exity"].values-from_df["YCORD"].values)
x2=np.abs(from_df2["exitx"].values-to_df2["entryx"].values)
y2=np.abs(from_df2["exity"].values-to_df2["entryy"].values)
x3=np.abs(to_df2["entryx"].values-to_df["XCORD"].values)
y3=np.abs(to_df2["entryy"].values-to_df["YCORD"].values)
pot["distance2"]=x1+x2+x3+y1+y2+y3
Vectorize the distance_ot function to calculate all distances at once. I would begin populating a from_df and a to_df like the following:
import numpy as np
from_df = pl.loc[np.in1d(pl.loc.index, pot["from_wpadr"])
to_df = pl.loc[np.in1d(pl.loc.index, pot["to_wpadr"])
Then you can continue as in your function:
sxcord=from_df["XCORD"]
sycord=from_df["YCORD"]
excord=to_df["XCORD"]
eycord=to_df["YCORD"]
x=np.abs(excord-sxcord); y=np.abs(eycord-sycord)
distances=x+y
This will calculate all the distances at once. Your if clause can also be vectorized, your results will be calculated into different arrays, those which match the if clause and those that doesn't, you just have to keep track of the boolean array, so you can put them together in the dataframe afterwards:
first_three_equals = np.char.ljust(pot["from_wpadr"].values.astype(str), 3) \
== np.char.ljust(pot["to_wpadr"].values.astype(str), 3)

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)
have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.
A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice

Combining data from two dataframe columns into one column

I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.