Avoiding pandas chained selection - python-2.7

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.

How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000

I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.

Related

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Reference a list of dicts

Python 2.7 on Mint Cinnamon 17.3.
I have a bit of test code employing a list of dicts and despite many hours of frustration, I cannot seem to work out why it is not working as it should do.
blockagedict = {'location': None, 'timestamp': None, 'blocked': None}
blockedlist = [blockagedict]
blockagedict['location'] = 'A'
blockagedict['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockagedict['blocked'] = True
blockagedict['location'] = 'B'
blockagedict['timestamp'] = '12-Apr-2016 01:01:09.312459'
blockagedict['blocked'] = False
blockedlist.append(blockagedict)
for test in blockedlist:
print test['location'], test['timestamp'], test['blocked']
This always produces the following output and I cannot work out why and cannot see if I have anything wrong with my code. It always prints out the last set of dict values but should print all, if I am not mistaken.
B 12-Apr-2016 01:01:09.312459 False
B 12-Apr-2016 01:01:09.312459 False
I would be happy for someone to show me the error of my ways and put me out of my misery.
It is because the line blockedlist = [blockagedict] actually stores a reference to the dict, not a copy, in the list. Your code effectively creates a list that has two references to the very same object.
If you care about performance and will have 1 million dictionaries in a list, all with the same keys, you will be better off using a NumPy structured array. Then you can have a single, efficient data structure which is basically a matrix of rows and named columns of appropriate types. You mentioned in a comment that you may know the number of rows in advance. Here's a rewrite of your example code using NumPy instead, which will be massively more efficient than a list of a million dicts.
import numpy as np
dtype = [('location', str, 1), ('timestamp', str, 27), ('blocked', bool)]
count = 2 # will be much larger in the real program
blockages = np.empty(count, dtype) # use zeros() instead if some data may never be populated
blockages[0]['location'] = 'A'
blockages[0]['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockages[0]['blocked'] = True
blockages['location'][1] = 'B' # n.b. indexing works this way too
blockages['timestamp'][1] = '12-Apr-2016 01:01:09.312459'
blockages['blocked'][1] = False
for test in blockages:
print test['location'], test['timestamp'], test['blocked']
Note that the usage is almost identical. But the storage is in a fixed size, single allocation. This will reduce memory usage and compute time.
As a nice side effect, writing it as above completely sidesteps the issue you originally had, with multiple references to the same row. Now all the data is placed directly into the matrix with no object references at all.
Later in a comment you mention you cannot use NumPy because it may not be installed. Well, we can still avoid unnecessary dicts, like this:
from array import array
blockages = {'location': [], 'timestamp': [], 'blocked': array('B')}
blockages['location'].append('A')
blockages['timestamp'].append('12-Apr-2016 01:01:08.702149')
blockages['blocked'].append(True)
blockages['location'].append('B')
blockages['timestamp'].append('12-Apr-2016 01:01:09.312459')
blockages['blocked'].append(False)
for location, timestamp, blocked in zip(*blockages.values()):
print location, timestamp, blocked
Note I use array here for efficient storage of the fixed-size blocked values (this way each value takes exactly one byte).
You still end up with resizable lists that you could avoid, but at least you don't need to store a dict in every slot of the list. This should still be more efficient.
Ok, I have initialised the list of dicts right off the bat and this seems to work. Although I am tempted to write a class for this.
blockedlist = [{'location': None, 'timestamp': None, 'blocked': None} for k in range(2)]
blockedlist[0]['location'] = 'A'
blockedlist[0]['timestamp'] = '12-Apr-2016 01:01:08.702149'
blockedlist[0]['blocked'] = True
blockedlist[1]['location'] = 'B'
blockedlist[1]['timestamp'] = '12-Apr-2016 01:01:09.312459'
blockedlist[1]['blocked'] = False
for test in blockedlist:
print test['location'], test['timestamp'], test['blocked']
And this produces what I was looking for:
A 12-Apr-2016 01:01:08.702149 True
B 12-Apr-2016 01:01:09.312459 False
I will be reading from a text file with 1 to 2 million lines, so converting the code to iterate through the lines won't be a problem.

Datetime in python - speed of calculations - big data

I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).
I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.
I have tried two different functions but both are too slow:
The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.
For Loop
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = []
for i in range(len(sframe_obj[t2_colname_str])):
t2 = sframe_obj[t2_colname_str][i]
t1 = sframe_obj[t1_colname_str][i]
try:
diff = t2 - t1
diff_days = diff.days
diff_days_list.append(diff_days)
except TypeError:
diff_days_list.append(None)
sframe_obj[new_colname] = gl.SArray(diff_days_list)
List Comprehension
I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]
sframe_obj[new_colname] = gl.SArray(diff_days_list)
Additional Notes
I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.
GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html
I'm glad you found a workable way for you, however SArrays allow vector operations, so you don't need to loop through every element of the column. SArrays will iterate, but they're REALLY slow at that.
Unfortunately, SArrays don't support vector operations on datetime types because they don't support a "timedelta" type. You can do this though:
diff = sframe_obj[t2_colname].astype(int) - sframe_obj[t1_colname].astype(int)
That will convert the columns to a UNIX timestamp and then do a vectorized difference operation, which should be plenty fast...at least faster than a conversion to NumPy.

how to apply cell style when using `append` in openpyxl?

I am using openpyxl to create an Excel worksheet. I want to apply styles when I insert the data. The trouble is that the append method takes a list of data and automatically inserts them to cells. I cannot seem to specify a font to apply to this operation.
I can go back and apply a style to individual cells after-the-fact, but this requires overhead to find out how many data points were in the list, and which row I am currently appending to. Is there an easier way?
This illustrative code shows what I would like to do:
def create_xlsx(self, header):
self.ft_base = Font(name='Calibri', size=10)
self.ft_bold = self.ft_base.copy(bold=True)
if header:
self.ws.append(header, font=ft_bold) # cannot apply style during append
ws.append() is designed for appending rows of data easily. It does, however, also allow you to include placeless cells within a row so that you can apply formatting while adding data. This is primarily of interest when using write_only=True but will work for normal workbooks.
Your code would look something like:
data = [1, 3, 4, 9, 10]
def styled_cells(data):
for c in data:
if c == 1:
c = Cell(ws, column="A", row=1, value=c)
c.font = Font(bold=True)
yield c
ws.append(styled_cells(data))
openpyxl will correct the coordinates of such cells.

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param