Python script | long running | Need suggestions to optimize - python-2.7

I have written this script to generate a dataset which would contain 15 minute time intervals based on the inputs provided for operational hours for all days of a week for 365 days.
example: Let us say Store 1 opens at 9 AM and closes at 9 PM on all days. That is 12 hours everyday. 12*4 = 48(15 minute periods a day). 48 * 365 = 17520 (15 minute periods for a year).
The sample dataset only contains 5 sites but there are about 9000 sites that this script needs to generate data for.
The script obviously runs for a handful of sites(100) and couple of days(2) but needs to run for sites(9000) and 365 days.
Looking for suggestions to make this run faster. This will be running on a local machine.
input data: https://drive.google.com/open?id=1uLYRUsJ2vM-TIGPvt5RhHDhTq3vr4V2y
output data: https://drive.google.com/open?id=13MZCQXfVDLBLFbbmmVagIJtm6LFDOk_T
Please let me know if I can help with anything more to get this answered.
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
import pandas as pd
import numpy as np
import cProfile
from datetime import timedelta, date, datetime
#inputs
empty_data = pd.DataFrame(columns=['store','timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
if weekno<5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['m_f_open_hrs'].iloc[j],data['m_f_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['m_f_close_hrs'].iloc[j],data['m_f_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
elif weekno==5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sat_open_hrs'].iloc[j],data['sat_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sat_close_hrs'].iloc[j],data['sat_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
else:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sun_open_hrs'].iloc[j],data['sun_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sun_close_hrs'].iloc[j],data['sun_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data

I think the most time consuming tasks in your script are the datetime calculations.
You should try to make all of those calculations using UNIX Time. It basically represents time as an integer that counts seconds... so you could take two UNIX dates and see the difference just by doing simple subtraction.
In my opinion you should perform all the operations like that... and when the process has finished you can make all the datetime conversions to a more readable date format.
Other thing that you should change in your script is all the code repetition that is almost identical. It won't improve the performance, but it improves readability, debugging and your skills as a programmer. As a simple example I have refactored some of the code (you probably can do better than what I did, but this is just an example).
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
from datetime import timedelta, date, datetime
import numpy as np
import cProfile
import pandas as pd
# inputs
empty_data = pd.DataFrame(columns=['store', 'timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
week_range = 'sun'
if weekno < 5:
week_range = 'm_f'
elif weekno == 5:
week_range = 'sat'
first_time = datetime(curr_date_year,curr_date_month,curr_date_day,data[week_range + '_open_hrs'].iloc[j],data[week_range + '_open_min'].iloc[j])
second_time = datetime(curr_date_year,curr_date_month,curr_date_day, data[week_range + '_close_hrs'].iloc[j],data[week_range + '_close_min'].iloc[j])
dts = [ dt.strftime('%Y-%m-%d %H:%M') for dt in datetime_range(first_time, second_time, timedelta(minutes=15)) ]
vert = pd.DataFrame(dts, columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data
Good luck!

Related

Experiencing odd output from django.utils.timezone.now(), datetime.datetime.now() and pytz.timezone

I'm experiencing strange behavior when attempting to convert between UTC and specific timezones. I'd love for someone to explain why I'm seeing this behavior and what the more "correct" way of getting timezone information might be.
Code:
import pytz
import datetime
from django.utils import timezone
print(timezone.now())
print(pytz.utc.localize(datetime.datetime.now()))
print('\n')
def get_local_and_utc_date_ranges(days=1500, days_ago=2, local_timezone="America/Asuncion"):
seller_timezone = pytz.timezone(local_timezone)
utc_timezone = pytz.utc
seller_today = timezone.now().astimezone(seller_timezone)
seller_days_ago = seller_today - timezone.timedelta(days=days_ago)
local_date_end = seller_days_ago.replace(hour=23, minute=59, second=59, microsecond=999999)
local_date_start = (local_date_end - timezone.timedelta(days=days)).replace(hour=0, minute=0, second=0, microsecond=0)
utc_date_end = local_date_end.astimezone(utc_timezone)
utc_date_start = local_date_start.astimezone(utc_timezone)
date_ranges = {
"local_date_end": local_date_end,
"local_date_start": local_date_start,
"utc_date_end": utc_date_end,
"utc_date_start": utc_date_start,
}
return date_ranges
def get_utc_and_local_date_ranges(days=1500, days_ago=2, local_timezone='America/Asuncion'):
seller_timezone = pytz.timezone(local_timezone)
utc_timezone = pytz.utc
utc_today = datetime.datetime.utcnow()
utc_days_ago = utc_today - datetime.timedelta(days=days_ago)
local_date_end = seller_timezone.localize(utc_days_ago).replace(
hour=23, minute=59, second=59, microsecond=999999
)
local_date_start = (local_date_end - datetime.timedelta(days=days)).replace(
hour=0, minute=0, second=0, microsecond=0
)
utc_date_end = local_date_end.astimezone(utc_timezone)
utc_date_start = local_date_start.astimezone(utc_timezone)
date_ranges = {
'local_date_end': local_date_end,
'local_date_start': local_date_start,
'utc_date_end': utc_date_end,
'utc_date_start': utc_date_start,
}
return date_ranges
days = 1500
days_ago = 2
dates = get_local_and_utc_date_ranges(days=days, days_ago=days_ago)
dates2 = get_utc_and_local_date_ranges(days=days, days_ago=days_ago)
print('dates1:')
print('local_date_start:', dates['local_date_start'])
print('local_date_end:', dates['local_date_end'])
print('utc_date_start:', dates['utc_date_start'])
print('utc_date_end:', dates['utc_date_end'])
print('\n')
print('dates2:')
print('local_date_start:', dates2['local_date_start'])
print('local_date_end:', dates2['local_date_end'])
print('utc_date_start:', dates2['utc_date_start'])
print('utc_date_end:', dates2['utc_date_end'])
print('\n')
Output:
2019-03-25 18:57:55.929908+00:00
2019-03-25 18:57:55.930005+00:00
dates1:
local_date_start: 2015-02-12 00:00:00-04:00
local_date_end: 2019-03-23 23:59:59.999999-04:00
utc_date_start: 2015-02-12 04:00:00+00:00
utc_date_end: 2019-03-24 03:59:59.999999+00:00
dates2:
local_date_start: 2015-02-12 00:00:00-03:00
local_date_end: 2019-03-23 23:59:59.999999-03:00
utc_date_start: 2015-02-12 03:00:00+00:00
utc_date_end: 2019-03-24 02:59:59.999999+00:00
Note the inconsistent UTC offset (that particular timezone switched to DST on Mar 23rd). But when I try to replicate the issue using the following code:
import pytz
import datetime
from django.utils import timezone
now1 = timezone.now() - datetime.timedelta(days=2)
now2 = pytz.utc.localize(datetime.datetime.now()) - datetime.timedelta(days=2)
seller_timezone = pytz.timezone('America/Asuncion')
print(now1.astimezone(seller_timezone).replace(
hour=23, minute=59, second=59, microsecond=999999
))
print(now2.astimezone(seller_timezone).replace(
hour=23, minute=59, second=59, microsecond=999999
))
The output is correct:
2019-03-23 23:59:59.999999-03:00
2019-03-23 23:59:59.999999-03:00
I'm hoping someone can understand why this behavior is happening and how I might avoid the inconsistencies if so.
Your get_local_and_utc_date_ranges() function is producing incorrect results because it's doing datetime arithmetic (i.e. subtracting a timedelta) with a localized time, which doesn't work.
seller_today = timezone.now().astimezone(seller_timezone)
seller_days_ago = seller_today - timezone.timedelta(days=days_ago)
This is noted in the datetime module documentation:
As for addition, the result [of subtracting a timedelta] has the same tzinfo attribute as the input datetime, and no time zone adjustments are done even if the input is aware.
This is also noted in the pytz documentation:
If you perform date arithmetic on local times that cross DST boundaries, the result may be in an incorrect timezone.
pytz offers a fix:
A normalize() method is provided to correct this.
So you could use:
seller_days_ago = seller_timezone.normalize(seller_today - timezone.timedelta(days=days_ago))
...
local_date_start = seller_timezone.normalize(local_date_end - timezone.timedelta(days=days)).replace(hour=0, minute=0, second=0, microsecond=0)
However, the documentation also notes that:
The preferred way of dealing with times is to always work in UTC.
So a better solution would be to only do arithmetic in UTC:
utc_today = datetime.datetime.utcnow()
utc_date_end = utc_today - datetime.timedelta(days=days_ago)
utc_date_start = utc_date_end - datetime.timedelta(days=days)
local_date_end = seller_timezone.localize(utc_date_end).replace(hour=23, minute=59, second=59, microsecond=999999)
local_date_start = seller_timezone.localize(utc_date_start).replace(hour=0, minute=0, second=0, microsecond=0)

How to query to fetch last 5 months records?

I have a model named 'DemoModel' it has a field called demo_date.
I want to fetch the last 5 months i.e;(from current month records to past 5 months records) records by querying on the demo_date field.
My models look like
class DemoModel(models.Model):
demo_date = models.DateTimeField()
from datetime import datetime, timedelta
today = datetime.today()
long_ago = today + timedelta(days=-150)
retrieved_data = DemoModel.objects.filter(demo_date__gte=long_ago)
Use
dateutil.relativedelta import relativedelta
to calculate the five_months_ago parameter accurately.
And then get the objects like this:
target_set = DemoModel.objects.filter(demo_date__gte=five_months_ago)
This function give subscription or add months
def monthdelta(date, delta):
m, y = (date.month+delta) % 12, date.year + ((date.month)+delta-1) // 12
if not m: m = 12
d = min(date.day, [31,
29 if y%4==0 and not y%400==0 else 28,31,30,31,30,31,31,30,31,30,31][m-1])
return date.replace(day=d,month=m, year=y)
query goes here
from datetime import datetime
query= DemoModel.objects.filter(demo_date__gte=monthdelta(datetime.now(), -5)
)

django query aggregate function is slow?

I am working with Django to see how to handle large databases. I use a database with fields name, age, date of birth(dob) and height. The database has about 500000 entries. I have to find the average height of persons of (1) same age and (2) born in same year. The aggregate function in querying table takes about 10s. Is it usual or am I missing something?
For age:
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height')
ht.append(aggr)
From dob,
age = [i[0].year for i in Data.objects.values_list('dob').distinct()]
for each in age:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg(‌​'height')
ht.append(aggr)
The year has to be extracted from dob. It is SQLite and I cannot use __year (join).
For these queries to be efficient, you have to create indexes on the age and dob columns.
You will get a small additional speedup by using covering indexes, i.e., using two-column indexes that also include the height column.
full version with time compare loop and query set version
import time
from dd.models import Data
from django.db.models import Avg
from django.db.models.functions import ExtractYear
for age
start = time.time()
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height'))
ht.append(aggr)
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.values('age').annotate(ag_ht=Avg('height')).order_by('age')
ht_qs = qs.values_list('age', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time
for dob year, with easy refactoring your version(add set in the years)
start = time.time()
years = set([i[0].year for i in Data.objects.values_list('dob').distinct()])
ht_year_loop = []
for each in years:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg('height'))
ht_year_loop.append((each, aggr.get('ag_ht')))
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.annotate(dob_year=ExtractYear('dob')).values('dob_year').annotate(ag_ht=Avg('height'))
ht_qs = qs.values_list('dob_year', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time

Difference between two dates python/Django

I need to know how to get the time elapsed between the edit_date(a column from one of my models) and datetime.now(). My edit_date column is under the DateTimeField format. (I'm using Python 2.7 and Django 1.10)
This is the function I'm trying to do:
def time_in_status(request):
for item in Reporteots.objects.exclude(edit_date__exact=None):
date_format = "%Y-%m-%d %H:%M:%S"
a = datetime.now()
b = item.edit_date
c = a - b
dif = divmod(c.days * 86400 + c.minute, 60)
days = str(dif)
print days
The only thing I'm getting from this fuction are the minutes elapsed and seconds. What I need is to get this date in the following format:
Time_elapsed = 3d 47m 23s
Any ideas? let me know if I'm not clear of if you need more information
Thanks for your attention,
Take a look at dateutil.relativedelta:
http://dateutil.readthedocs.io/en/stable/relativedelta.html
from dateutil.relativedelta import relativedelta
from datetime import datetime
now = datetime.now()
ago = datetime(2017, 2, 11, 13, 5, 22)
diff = relativedelta(ago, now)
print "%dd %dm %ds" % (diff.days, diff.minutes, diff.seconds)
I did that code from memory, so you may have to tweak it to your needs.
Try something like
c = a - b
minutes = (c.seconds % 3600) // 60
seconds = c.seconds % 60
print "%sd %sm %ss" % (c.days, minutes, seconds)

Filling Value of a Pandas Data Frame From a Large DB Query (Python)

I am running a snippet of code that queries a database and then fills in a pandas dataframe with a value of 1 if that tuple is present in the query. it does this by running the query then iterates over the tuples and fills in the dataframe. However, the query returns almost 8 million rows of data.
My question is if anyone knows how to speed up a process like this. Here is the code below:
user_age = pd.read_sql_query(sql_age, datastore, index_col=['userid']).age.astype(np.int, copy=False)
x = pd.DataFrame(0, index=user_age.index, columns=range(366), dtype=np.int8)
for r in pd.read_sql_query(sql_active, datastore, chunksize=50000):
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
Thank you in advance!
You could save some time by replacing the Python loop
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
with a NumPy array assignment using "advanced integer indexing":
x[npidx[r['userid']], r['day']] = 1
On a 80000-row DataFrame, using_numpy (below) is about 6x faster:
In [7]: %timeit orig()
1 loop, best of 3: 984 ms per loop
In [8]: %timeit using_numpy()
10 loops, best of 3: 162 ms per loop
import numpy as np
import pandas as pd
def mock_read_sql_query():
np.random.seed(2016)
for arr in np.array_split(index, N//M):
size = len(arr)
df = pd.DataFrame({'userid':arr , 'day':np.random.randint(366, size=size)})
df = df[['userid', 'day']]
yield df
N, M = 8*10**4, 5*10**2
index = np.arange(N)
np.random.shuffle(index)
columns = range(366)
def using_numpy():
npidx = np.empty_like(index)
npidx[index] = np.arange(len(index))
x = np.zeros((len(index), len(columns)), dtype=np.int8)
for r in mock_read_sql_query():
x[npidx[r['userid']], r['day']] = 1
x = pd.DataFrame(x, columns=columns, index=index)
return x
def orig():
x = pd.DataFrame(0, index=index, columns=columns, dtype=np.int8)
for r in mock_read_sql_query():
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
return x
expected = orig()
result = using_numpy()
expected_index, expected_col = np.where(expected)
result_index, result_col = np.where(result)
assert np.equal(expected_index, result_index).all()
assert np.equal(expected_col, result_col).all()