Adding data to a Pandas dataframe - python-2.7

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.

No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param

Related

Any ideas on Iterating over dataframe and applying regex?

This may be a rudimentary problem but I am new to pandas.
I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column)
I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem?
My dataframe looks like this:
df =pd.read_csv('foobar.csv')
df[['column1','column2, 'TEXT']]
My approach looks like this:
for Individual_row in df['TEXT'].iteritems():
parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row)
res = {g[0].strip() : g[1].strip() for g in parsed}
Many thanks in advance
you can try the following instead of loop:
df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' )
This will create a new column with your resultant data.

How do I compare two columns at once against two different data frames in python (pandas)?

df1 contains two columns of Lat and Long, and several thousand rows. df2 also contains two columns of lat and long with many rows. Essentially, df2 is a list of reference locations that I want to compare df1 with. I want to compare both the Latitude and Longitude of df1 with df2 to say their locations match, or say they don't. i.e.,
my_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
reference_data = pd.read_csv('/path/to/file', usecols = ['Lat','Lon'])
In simpler words, I want to say that if the location in each row in my_data is present in reference_data, label it 1, else label it 0. Since this location has two components Lat and Long, they BOTH need to be present next to each other anywhere in the reference dataframe. Is there an easy one-liner?
You could generate this by using the merge function to join the reference_data to my_data with an indicator.
new_df = pd.merge(my_data, reference_data, on=['Lat','Lon'], how='left', indicator='flag')
You'll get a dataframe that should look exactly like my_data but include a new column "flag" which either says "left_only" or "both".
To get it as a [0,1] label:
new_df['bin_flag'] = (new_df['flag']=='both').astype(int)
To my knowledge, there is not an actual one-liner for this one.
you can do also something like:
my_data.apply(lambda x: (x['Lat'] in reference_data['Lat'] and x['Lon'] in reference_data['Lon']) * 1.0, axis=1)
and then you can just assign it wherever you like.
or, the same way but maybe easier to see what's going on:
my_data.apply(lambda x: ((x['Lat'], x['Lon']) in zip(reference_data['Lat'], reference_data['Lon'])) * 1.0, axis=1)

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Python 2.7 - How to call individual columns from transposed csv file

I understand that the csv module exists, however for my current project we are not allowed to use the module to call csv files.
My code is as follows;
table = []
for line in open("data.csv"):
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
rows = transposed[1][1:]
rows = [float(i) for i in rows]
I'm really new to python so this is probably a massively basic question, I've been scouring the internet all day and struggle to find a solution. All I need to do is to be able to call data from any individual column so I can analyse it. Thanks
your data is organized in a list of lists. Each sub list represents a row. To better illustrate this I would avoid using list comprehensions because they are more difficult to read. Additionally I would avoid using variables like 'i' and 'j' and instead use more descriptive names like row or column. Here is a simple example of how I would accomplish this
def read_csv():
table = []
with open("data.csv") as fileobj:
for line in fileobj.readlines():
data = line.strip().split(',')
table.append(data)
return table
def get_column_data(data, column_index):
column_data = []
for row in data:
cell_data = row[column_index]
column_data.append(cell_data)
return column_data
data = read_csv()
get_column_data(data, column_index=2) #example usage

Avoiding pandas chained selection

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.
How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.