'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba - python-2.7

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'

Same question here. I think the problem might be the lambda function since numba does not support function creation.

I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Related

Fitting multiple data sets using lmfit without writting an objective function

This topic describes how to fit multiple data-sets using lmfit:
Python and lmfit: How to fit multiple datasets with shared parameters?
However it uses a fitting/objective function written by the user.
I was wondering if it's possible to fit multiple data-sets using lmfit without writing an objective function and using model.fit() method of the model class.
As an example: Lets say we have multiple data sets of (x,y) coordinates that we want to fit using the same model function in order to find the set of parameters that on average fit all the data best.
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
def gauss(x, amp, cen, sigma):
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.09)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.1)
y2= gauss(x2, 0.8,48.4.,4.5)+ np.random.normal(size=len(x2), scale=0.1)
mod= GaussianModel()
params= mod.make_params()
mod.fit([y1,y2], params, x= [x1, x2])
I guess if this is possible the data has to be passed to mod.fit in the right type. The documentation only says that mod.fit takes an array-like data input.
I tried to give it lists and arrays. If I pass the different data sets as a list I get a ValueError: setting an array element with a sequence
If I pass an array I get an AttributeError: 'numpy.ndarray' has no atribute 'exp'
So am I just trying to do something that isn't possible or am I doing something wrong?
Well, I think the answer is "sort of". The lmfit.Model class is meant to represent a model for an array of data. So, if you can map your multiple datasets into a numpy ndarray (say, with np.concatenate), you can probably write a Model function to represent this by building sub-models for the different datasets and concatenating them in the same way.
I don't think you could do that with any of the built-in models. I also think that once you start down the road of writing complex model functions, it isn't a very big jump to writing objective functions. That is, what would be
def model_function(x, a, b, c):
### do some calculation with x, a, b, c values
result = a + x*b + x*x*c
return result
might become
def objective_function(params, x, data):
vals = params.valuesdict()
return data - model_function(x, vals['a'], vals['b'], vals['c'])
If that do_calc() is doing anything complex, the additional burden of unpacking the parameters and subtracting the data is pretty small. And, especially if some parameters would be used for multiple datasets and some only for particular datasets, you'll have to manage that in either the model function or the objective function. In the example you link to, my answer included a loop over datasets, picking out parameters by name for each dataset. You'll probably want to do something like that. You could probably do that in a model function by thinking of it as modeling the concatenated datasets, but I'm not sure you'd really gain a lot by doing that.
I found the problem. Actually model.fit() will handle arrays of multiple data sets just fine and perform a proper fit. The correct call of model.fit() with multiple data sets would be:
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
def gauss(x, amp, cen, sigma):
"basic gaussian"
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.1)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.01)
y2= gauss(x2, 0.8,48.4,4.5)+ np.random.normal(size=len(x2), scale=0.01)
mod= GaussianModel()
params= mod.make_params()
params['amplitude'].set(1.,min=0.01,max=100.)
params['center'].set(1.,min=0.01,max=100.)
params['sigma'].set(1.,min=0.01,max=100.)
result= mod.fit(np.array([y1,y2]), params,method='basinhopping',
x=np.array([x1,x2]))
print(result.fit_report(min_correl=0.5))
fig, ax = plt.subplots()
plt.plot(x1,y1, lw=2, color='red')
plt.plot(x2,y2, lw=2, color='orange')
plt.plot(x1,result.eval(x=x1), lw=2, color='black')
plt.show()
The problem in the original code actually lies in the fact that my data sets don't have the same length. However I'm not sure at all how to handle this in the most elegant way?

Difficulty Understanding TensorFlow Computations

I'm new to TensorFlow and have difficulty understanding how the computations works. I could not find the answer to my question on the web.
For the following piece of code, the last time I print "d" in the for loop of the "train_neural_net()" function, I'm expecting the values to be identical to when I print "test_distance.eval". But they are way different. Can anyone tell me why this is happening? Isn't TensorFlow supposed to cache the Variable results learned in the for loop and use them when I run "test_distance.eval"?
def neural_network_model1(data):
nn1_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn1_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn1_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn1_l1 = tf.add(tf.matmul(data, nn1_hidden_1_layer["weights"]), nn1_hidden_1_layer["biasses"])
nn1_l1 = tf.sigmoid(nn1_l1)
nn1_l2 = tf.add(tf.matmul(nn1_l1, nn1_hidden_2_layer["weights"]), nn1_hidden_2_layer["biasses"])
nn1_l2 = tf.sigmoid(nn1_l2)
nn1_output = tf.add(tf.matmul(nn1_l2, nn1_output_layer["weights"]), nn1_output_layer["biasses"])
return nn1_output
def neural_network_model2(data):
nn2_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn2_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn2_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn2_l1 = tf.add(tf.matmul(data, nn2_hidden_1_layer["weights"]), nn2_hidden_1_layer["biasses"])
nn2_l1 = tf.sigmoid(nn2_l1)
nn2_l2 = tf.add(tf.matmul(nn2_l1, nn2_hidden_2_layer["weights"]), nn2_hidden_2_layer["biasses"])
nn2_l2 = tf.sigmoid(nn2_l2)
nn2_output = tf.add(tf.matmul(nn2_l2, nn2_output_layer["weights"]), nn2_output_layer["biasses"])
return nn2_output
def train_neural_net():
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(prediction1, prediction2)), reduction_indices=1))
cost = tf.reduce_mean(tf.multiply(y, distance))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 500
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
test_distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(test_result1, test_result2)), reduction_indices=1))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
_, d = sess.run([optimizer, distance], feed_dict = {x1: train_x1, x2: train_x2, y: train_y})
print("Epoch", epoch, "distance", d)
print("test distance", test_distance.eval({x3: train_x1, x4: train_x2}))
train_neural_net()
Each time you call the functions neural_network_model1() or neural_network_model2(), you create a new set of variables, so there are four sets of variables in total.
The call to sess.run(tf.global_variables_initializer()) initializes all four sets of variables.
When you train in the for loop, you only update the first two sets of variables, created with these lines:
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
When you evaluate with test_distance.eval(), the tensor test_distance depends only on the variables that were created in the last two sets of variables, which were created with these lines:
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
These variables were never updated in the training loop, so the evaluation results will be based on the random initial values.
TensorFlow does include some code for sharing weights between multiple calls to the same function, using with tf.variable_scope(...): blocks. For more information on how to use these, see the tutorial on variables and sharing on the TensorFlow website.
You don't need to define two function for generating models, you can use tf.name_scope, and pass a model name to the function to use it as a prefix for variable declaration. On the other hand, you defined two variables for distance, first is distance and second is test_distance . But your model will learn from train data to minimize cost which is only related to first distance variable. Therefore, test_distance is never used and the model which is related to it, will never learn anything! Again there is no need for two distance functions. You only need one. When you want to calculate train distance, you should feed it with train data and when you want to calculate test distance you should feed it with test data.
Anyway, if you want second distance to work, you should declare another optimizer for it and also you have to learn it as you have done for first one. Also you should consider the fact that models are learning base on their initial values and training data. Even if you feed both models with exactly same training batches, you can't expect to have exactly similar characteristics models since initial values for weights are different and this could cause falling into different local minimum of error surface. At the end notice that whenever you call neural_network_model1 or neural_network_model2 you will generate new weights and biases, because tf.Variable is generating new variables for you.

Datetime in python - speed of calculations - big data

I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).
I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.
I have tried two different functions but both are too slow:
The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.
For Loop
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = []
for i in range(len(sframe_obj[t2_colname_str])):
t2 = sframe_obj[t2_colname_str][i]
t1 = sframe_obj[t1_colname_str][i]
try:
diff = t2 - t1
diff_days = diff.days
diff_days_list.append(diff_days)
except TypeError:
diff_days_list.append(None)
sframe_obj[new_colname] = gl.SArray(diff_days_list)
List Comprehension
I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]
sframe_obj[new_colname] = gl.SArray(diff_days_list)
Additional Notes
I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.
GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html
I'm glad you found a workable way for you, however SArrays allow vector operations, so you don't need to loop through every element of the column. SArrays will iterate, but they're REALLY slow at that.
Unfortunately, SArrays don't support vector operations on datetime types because they don't support a "timedelta" type. You can do this though:
diff = sframe_obj[t2_colname].astype(int) - sframe_obj[t1_colname].astype(int)
That will convert the columns to a UNIX timestamp and then do a vectorized difference operation, which should be plenty fast...at least faster than a conversion to NumPy.

Avoiding pandas chained selection

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.
How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param