Datetime in python - speed of calculations - big data - python-2.7

I want to find the difference (in days) between two columns in a dataframe (more specifically in the graphlab SFrame datastructure).
I have tried to write a couple of functions to do this but I cannot seem to create a function that is fast enough. Speed is my issue right now as I have ~80 million rows to process.
I have tried two different functions but both are too slow:
The t2_colname_str and t1_colname_str arguments are the column-names of which I want to use, and both columns contain datetime.datetime objects.
For Loop
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = []
for i in range(len(sframe_obj[t2_colname_str])):
t2 = sframe_obj[t2_colname_str][i]
t1 = sframe_obj[t1_colname_str][i]
try:
diff = t2 - t1
diff_days = diff.days
diff_days_list.append(diff_days)
except TypeError:
diff_days_list.append(None)
sframe_obj[new_colname] = gl.SArray(diff_days_list)
List Comprehension
I know this is not the intended purpose of list comprehensions, but I just tried it to see if it was faster.
def diff_days(sframe_obj,t2_colname_str,t1_colname_str):
import graphlab as gl
import datetime as datetime
# creating the new column name to be used later
new_colname = str(t2_colname_str[:-9] + "_DiffDays_" + t1_colname_str[:-9])
diff_days_list = [(sframe_obj[t2_colname_str][i]-sframe_obj[t1_colname_str][i]).days if sframe_obj[t2_colname_str][i] and sframe_obj[t1_colname_str][i] != None else None for i in range(len(sframe_obj[t2_colname_str]))]
sframe_obj[new_colname] = gl.SArray(diff_days_list)
Additional Notes
I have been using GraphLab-Create by Dato and their SFrame data-structure mainly because it parallelizes all the computation which makes my analysis super-fast and it has a great library for machine learning applications. It's a great product if you haven't checked it out already.
GraphLab User Guide can be found here: https://dato.com/learn/userguide/index.html

I'm glad you found a workable way for you, however SArrays allow vector operations, so you don't need to loop through every element of the column. SArrays will iterate, but they're REALLY slow at that.
Unfortunately, SArrays don't support vector operations on datetime types because they don't support a "timedelta" type. You can do this though:
diff = sframe_obj[t2_colname].astype(int) - sframe_obj[t1_colname].astype(int)
That will convert the columns to a UNIX timestamp and then do a vectorized difference operation, which should be plenty fast...at least faster than a conversion to NumPy.

Related

Fitting multiple data sets using lmfit without writting an objective function

This topic describes how to fit multiple data-sets using lmfit:
Python and lmfit: How to fit multiple datasets with shared parameters?
However it uses a fitting/objective function written by the user.
I was wondering if it's possible to fit multiple data-sets using lmfit without writing an objective function and using model.fit() method of the model class.
As an example: Lets say we have multiple data sets of (x,y) coordinates that we want to fit using the same model function in order to find the set of parameters that on average fit all the data best.
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
def gauss(x, amp, cen, sigma):
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.09)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.1)
y2= gauss(x2, 0.8,48.4.,4.5)+ np.random.normal(size=len(x2), scale=0.1)
mod= GaussianModel()
params= mod.make_params()
mod.fit([y1,y2], params, x= [x1, x2])
I guess if this is possible the data has to be passed to mod.fit in the right type. The documentation only says that mod.fit takes an array-like data input.
I tried to give it lists and arrays. If I pass the different data sets as a list I get a ValueError: setting an array element with a sequence
If I pass an array I get an AttributeError: 'numpy.ndarray' has no atribute 'exp'
So am I just trying to do something that isn't possible or am I doing something wrong?
Well, I think the answer is "sort of". The lmfit.Model class is meant to represent a model for an array of data. So, if you can map your multiple datasets into a numpy ndarray (say, with np.concatenate), you can probably write a Model function to represent this by building sub-models for the different datasets and concatenating them in the same way.
I don't think you could do that with any of the built-in models. I also think that once you start down the road of writing complex model functions, it isn't a very big jump to writing objective functions. That is, what would be
def model_function(x, a, b, c):
### do some calculation with x, a, b, c values
result = a + x*b + x*x*c
return result
might become
def objective_function(params, x, data):
vals = params.valuesdict()
return data - model_function(x, vals['a'], vals['b'], vals['c'])
If that do_calc() is doing anything complex, the additional burden of unpacking the parameters and subtracting the data is pretty small. And, especially if some parameters would be used for multiple datasets and some only for particular datasets, you'll have to manage that in either the model function or the objective function. In the example you link to, my answer included a loop over datasets, picking out parameters by name for each dataset. You'll probably want to do something like that. You could probably do that in a model function by thinking of it as modeling the concatenated datasets, but I'm not sure you'd really gain a lot by doing that.
I found the problem. Actually model.fit() will handle arrays of multiple data sets just fine and perform a proper fit. The correct call of model.fit() with multiple data sets would be:
import numpy as np
from lmfit import Model, Parameters
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
def gauss(x, amp, cen, sigma):
"basic gaussian"
return amp*np.exp(-(x-cen)**2/(2.*sigma**2))
x1= np.arange(0.,100.,0.1)
x2= np.arange(0.,100.,0.1)
y1= gauss(x1, 1.,50.,5.)+ np.random.normal(size=len(x1), scale=0.01)
y2= gauss(x2, 0.8,48.4,4.5)+ np.random.normal(size=len(x2), scale=0.01)
mod= GaussianModel()
params= mod.make_params()
params['amplitude'].set(1.,min=0.01,max=100.)
params['center'].set(1.,min=0.01,max=100.)
params['sigma'].set(1.,min=0.01,max=100.)
result= mod.fit(np.array([y1,y2]), params,method='basinhopping',
x=np.array([x1,x2]))
print(result.fit_report(min_correl=0.5))
fig, ax = plt.subplots()
plt.plot(x1,y1, lw=2, color='red')
plt.plot(x2,y2, lw=2, color='orange')
plt.plot(x1,result.eval(x=x1), lw=2, color='black')
plt.show()
The problem in the original code actually lies in the fact that my data sets don't have the same length. However I'm not sure at all how to handle this in the most elegant way?

How to maintain order of insertion in dictionary in python? [duplicate]

I have a dictionary that I declared in a particular order and want to keep it in that order all the time. The keys/values can't really be kept in order based on their value, I just want it in the order that I declared it.
So if I have the dictionary:
d = {'ac': 33, 'gw': 20, 'ap': 102, 'za': 321, 'bs': 10}
It isn't in that order if I view it or iterate through it. Is there any way to make sure Python will keep the explicit order that I declared the keys/values in?
From Python 3.6 onwards, the standard dict type maintains insertion order by default.
Defining
d = {'ac':33, 'gw':20, 'ap':102, 'za':321, 'bs':10}
will result in a dictionary with the keys in the order listed in the source code.
This was achieved by using a simple array with integers for the sparse hash table, where those integers index into another array that stores the key-value pairs (plus the calculated hash). That latter array just happens to store the items in insertion order, and the whole combination actually uses less memory than the implementation used in Python 3.5 and before. See the original idea post by Raymond Hettinger for details.
In 3.6 this was still considered an implementation detail; see the What's New in Python 3.6 documentation:
The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5).
Python 3.7 elevates this implementation detail to a language specification, so it is now mandatory that dict preserves order in all Python implementations compatible with that version or newer. See the pronouncement by the BDFL. As of Python 3.8, dictionaries also support iteration in reverse.
You may still want to use the collections.OrderedDict() class in certain cases, as it offers some additional functionality on top of the standard dict type. Such as as being reversible (this extends to the view objects), and supporting reordering (via the move_to_end() method).
from collections import OrderedDict
OrderedDict((word, True) for word in words)
contains
OrderedDict([('He', True), ('will', True), ('be', True), ('the', True), ('winner', True)])
If the values are True (or any other immutable object), you can also use:
OrderedDict.fromkeys(words, True)
Rather than explaining the theoretical part, I'll give a simple example.
>>> from collections import OrderedDict
>>> my_dictionary=OrderedDict()
>>> my_dictionary['foo']=3
>>> my_dictionary['aol']=1
>>> my_dictionary
OrderedDict([('foo', 3), ('aol', 1)])
>>> dict(my_dictionary)
{'foo': 3, 'aol': 1}
Note that this answer applies to python versions prior to python3.7. CPython 3.6 maintains insertion order under most circumstances as an implementation detail. Starting from Python3.7 onward, it has been declared that implementations MUST maintain insertion order to be compliant.
python dictionaries are unordered. If you want an ordered dictionary, try collections.OrderedDict.
Note that OrderedDict was introduced into the standard library in python 2.7. If you have an older version of python, you can find recipes for ordered dictionaries on ActiveState.
Dictionaries will use an order that makes searching efficient, and you cant change that,
You could just use a list of objects (a 2 element tuple in a simple case, or even a class), and append items to the end. You can then use linear search to find items in it.
Alternatively you could create or use a different data structure created with the intention of maintaining order.
I came across this post while trying to figure out how to get OrderedDict to work. PyDev for Eclipse couldn't find OrderedDict at all, so I ended up deciding to make a tuple of my dictionary's key values as I would like them to be ordered. When I needed to output my list, I just iterated through the tuple's values and plugged the iterated 'key' from the tuple into the dictionary to retrieve my values in the order I needed them.
example:
test_dict = dict( val1 = "hi", val2 = "bye", val3 = "huh?", val4 = "what....")
test_tuple = ( 'val1', 'val2', 'val3', 'val4')
for key in test_tuple: print(test_dict[key])
It's a tad cumbersome, but I'm pressed for time and it's the workaround I came up with.
note: the list of lists approach that somebody else suggested does not really make sense to me, because lists are ordered and indexed (and are also a different structure than dictionaries).
You can't really do what you want with a dictionary. You already have the dictionary d = {'ac':33, 'gw':20, 'ap':102, 'za':321, 'bs':10}created. I found there was no way to keep in order once it is already created. What I did was make a json file instead with the object:
{"ac":33,"gw":20,"ap":102,"za":321,"bs":10}
I used:
r = json.load(open('file.json'), object_pairs_hook=OrderedDict)
then used:
print json.dumps(r)
to verify.
from collections import OrderedDict
list1 = ['k1', 'k2']
list2 = ['v1', 'v2']
new_ordered_dict = OrderedDict(zip(list1, list2))
print new_ordered_dict
# OrderedDict([('k1', 'v1'), ('k2', 'v2')])
Another alternative is to use Pandas dataframe as it guarantees the order and the index locations of the items in a dict-like structure.
I had a similar problem when developing a Django project. I couldn't use OrderedDict, because I was running an old version of python, so the solution was to use Django's SortedDict class:
https://code.djangoproject.com/wiki/SortedDict
e.g.,
from django.utils.datastructures import SortedDict
d2 = SortedDict()
d2['b'] = 1
d2['a'] = 2
d2['c'] = 3
Note: This answer is originally from 2011. If you have access to Python version 2.7 or higher, then you should have access to the now standard collections.OrderedDict, of which many examples have been provided by others in this thread.
Generally, you can design a class that behaves like a dictionary, mainly be implementing the methods __contains__, __getitem__, __delitem__, __setitem__ and some more. That class can have any behaviour you like, for example prividing a sorted iterator over the keys ...
if you would like to have a dictionary in a specific order, you can also create a list of lists, where the first item will be the key, and the second item will be the value
and will look like this
example
>>> list =[[1,2],[2,3]]
>>> for i in list:
... print i[0]
... print i[1]
1
2
2
3
You can do the same thing which i did for dictionary.
Create a list and empty dictionary:
dictionary_items = {}
fields = [['Name', 'Himanshu Kanojiya'], ['email id', 'hima#gmail.com']]
l = fields[0][0]
m = fields[0][1]
n = fields[1][0]
q = fields[1][1]
dictionary_items[l] = m
dictionary_items[n] = q
print dictionary_items

Difficulty Understanding TensorFlow Computations

I'm new to TensorFlow and have difficulty understanding how the computations works. I could not find the answer to my question on the web.
For the following piece of code, the last time I print "d" in the for loop of the "train_neural_net()" function, I'm expecting the values to be identical to when I print "test_distance.eval". But they are way different. Can anyone tell me why this is happening? Isn't TensorFlow supposed to cache the Variable results learned in the for loop and use them when I run "test_distance.eval"?
def neural_network_model1(data):
nn1_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn1_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn1_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn1_l1 = tf.add(tf.matmul(data, nn1_hidden_1_layer["weights"]), nn1_hidden_1_layer["biasses"])
nn1_l1 = tf.sigmoid(nn1_l1)
nn1_l2 = tf.add(tf.matmul(nn1_l1, nn1_hidden_2_layer["weights"]), nn1_hidden_2_layer["biasses"])
nn1_l2 = tf.sigmoid(nn1_l2)
nn1_output = tf.add(tf.matmul(nn1_l2, nn1_output_layer["weights"]), nn1_output_layer["biasses"])
return nn1_output
def neural_network_model2(data):
nn2_hidden_1_layer = {'weights': tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl1]))}
nn2_hidden_2_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])), 'biasses': tf.Variable(tf.random_normal([n_nodes_hl2]))}
nn2_output_layer = {'weights': tf.Variable(tf.random_normal([n_nodes_hl2, vector_size])), 'biasses': tf.Variable(tf.random_normal([vector_size]))}
nn2_l1 = tf.add(tf.matmul(data, nn2_hidden_1_layer["weights"]), nn2_hidden_1_layer["biasses"])
nn2_l1 = tf.sigmoid(nn2_l1)
nn2_l2 = tf.add(tf.matmul(nn2_l1, nn2_hidden_2_layer["weights"]), nn2_hidden_2_layer["biasses"])
nn2_l2 = tf.sigmoid(nn2_l2)
nn2_output = tf.add(tf.matmul(nn2_l2, nn2_output_layer["weights"]), nn2_output_layer["biasses"])
return nn2_output
def train_neural_net():
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(prediction1, prediction2)), reduction_indices=1))
cost = tf.reduce_mean(tf.multiply(y, distance))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epochs = 500
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
test_distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(test_result1, test_result2)), reduction_indices=1))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
_, d = sess.run([optimizer, distance], feed_dict = {x1: train_x1, x2: train_x2, y: train_y})
print("Epoch", epoch, "distance", d)
print("test distance", test_distance.eval({x3: train_x1, x4: train_x2}))
train_neural_net()
Each time you call the functions neural_network_model1() or neural_network_model2(), you create a new set of variables, so there are four sets of variables in total.
The call to sess.run(tf.global_variables_initializer()) initializes all four sets of variables.
When you train in the for loop, you only update the first two sets of variables, created with these lines:
prediction1 = neural_network_model1(x1)
prediction2 = neural_network_model2(x2)
When you evaluate with test_distance.eval(), the tensor test_distance depends only on the variables that were created in the last two sets of variables, which were created with these lines:
test_result1 = neural_network_model1(x3)
test_result2 = neural_network_model2(x4)
These variables were never updated in the training loop, so the evaluation results will be based on the random initial values.
TensorFlow does include some code for sharing weights between multiple calls to the same function, using with tf.variable_scope(...): blocks. For more information on how to use these, see the tutorial on variables and sharing on the TensorFlow website.
You don't need to define two function for generating models, you can use tf.name_scope, and pass a model name to the function to use it as a prefix for variable declaration. On the other hand, you defined two variables for distance, first is distance and second is test_distance . But your model will learn from train data to minimize cost which is only related to first distance variable. Therefore, test_distance is never used and the model which is related to it, will never learn anything! Again there is no need for two distance functions. You only need one. When you want to calculate train distance, you should feed it with train data and when you want to calculate test distance you should feed it with test data.
Anyway, if you want second distance to work, you should declare another optimizer for it and also you have to learn it as you have done for first one. Also you should consider the fact that models are learning base on their initial values and training data. Even if you feed both models with exactly same training batches, you can't expect to have exactly similar characteristics models since initial values for weights are different and this could cause falling into different local minimum of error surface. At the end notice that whenever you call neural_network_model1 or neural_network_model2 you will generate new weights and biases, because tf.Variable is generating new variables for you.

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

Avoiding pandas chained selection

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.
How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.