I have a model that looks something like that:
class Payment(TimeStampModel):
timestamp = models.DateTimeField(auto_now_add=True)
amount = models.FloatField()
creator = models.ForeignKey(to='Payer')
What is the correct way to calculate average spending per day?
I can aggregate by day, but then the days when a payer does not spend anything won't count, which is not correct
UPDATE:
So, let's say I have only two records in my db, one from March 1, and one from January 1. The average spending per day should be something
(Sum of all spendings) / (March 1 - January 1)
that is divided by 60
however this of course give me just an average spending per item, and number of days will give me 2:
for p in Payment.objects.all():
print(p.timestamp, p.amount)
p = Payment.objects.all().dates('timestamp','day').aggregate(Sum('amount'), Avg('amount'))
print(p
Output:
2019-03-05 17:33:06.490560+00:00 456.0
2019-01-05 17:33:06.476395+00:00 123.0
{'amount__sum': 579.0, 'amount__avg': 289.5}
You can aggregate min and max timestamp and the sum of amount:
from django.db.models import Min, Max, Sum
def average_spending_per_day():
aggregate = Payment.objects.aggregate(Min('timestamp'), Max('timestamp'), Sum('amount'))
min_datetime = aggregate.get('timestamp__min')
if min_datetime is not None:
min_date = min_datetime.date()
max_date = aggregate.get('timestamp__max').date()
total_amount = aggregate.get('amount__sum')
days = (max_date - min_date).days + 1
return total_amount / days
return 0
If there is a min_datetime then there is some data in the db table, and there is also max date and total amount, otherwise we return 0 or whatever you want.
It depends on your backend, but you want to divide the sum of amount by the difference in days between your max and min timestamp. In Postgres, you can simply subtract two dates to get the number of days between them. With MySQL there is a function called DateDiff that takes two dates and returns the number of days between them.
class Date(Func):
function = 'DATE'
class MySQLDateDiff(Func):
function = 'DATEDIFF'
def __init__(self, *expressions, **extra):
expressions = [Date(exp) for exp in expressions]
extra['output_field'] = extra.get('output_field', IntegerField())
super().__init__(*expressions, **extra)
class PgDateDiff(Func):
template = "%(expressions)s"
arg_joiner = ' - '
def __init__(self, *expressions, **extra):
expressions = [Date(exp) for exp in expressions]
extra['output_field'] = extra.get('output_field', IntegerField())
super().__init__(*expressions, **extra)
agg = {
avg_spend: ExpressionWrapper(
Sum('amount') / (PgDateDiff(Max('timestamp'), Min('timestamp')) + Value(1)),
output_field=DecimalField())
}
avg_spend = Payment.objects.aggregate(**agg)
That looks roughly right to me, of course, I haven't tested it. Of course, use MySQLDateDiff if that's your backend.
I have written this script to generate a dataset which would contain 15 minute time intervals based on the inputs provided for operational hours for all days of a week for 365 days.
example: Let us say Store 1 opens at 9 AM and closes at 9 PM on all days. That is 12 hours everyday. 12*4 = 48(15 minute periods a day). 48 * 365 = 17520 (15 minute periods for a year).
The sample dataset only contains 5 sites but there are about 9000 sites that this script needs to generate data for.
The script obviously runs for a handful of sites(100) and couple of days(2) but needs to run for sites(9000) and 365 days.
Looking for suggestions to make this run faster. This will be running on a local machine.
input data: https://drive.google.com/open?id=1uLYRUsJ2vM-TIGPvt5RhHDhTq3vr4V2y
output data: https://drive.google.com/open?id=13MZCQXfVDLBLFbbmmVagIJtm6LFDOk_T
Please let me know if I can help with anything more to get this answered.
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
import pandas as pd
import numpy as np
import cProfile
from datetime import timedelta, date, datetime
#inputs
empty_data = pd.DataFrame(columns=['store','timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
if weekno<5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['m_f_open_hrs'].iloc[j],data['m_f_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['m_f_close_hrs'].iloc[j],data['m_f_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
elif weekno==5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sat_open_hrs'].iloc[j],data['sat_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sat_close_hrs'].iloc[j],data['sat_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
else:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sun_open_hrs'].iloc[j],data['sun_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sun_close_hrs'].iloc[j],data['sun_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data
I think the most time consuming tasks in your script are the datetime calculations.
You should try to make all of those calculations using UNIX Time. It basically represents time as an integer that counts seconds... so you could take two UNIX dates and see the difference just by doing simple subtraction.
In my opinion you should perform all the operations like that... and when the process has finished you can make all the datetime conversions to a more readable date format.
Other thing that you should change in your script is all the code repetition that is almost identical. It won't improve the performance, but it improves readability, debugging and your skills as a programmer. As a simple example I have refactored some of the code (you probably can do better than what I did, but this is just an example).
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
from datetime import timedelta, date, datetime
import numpy as np
import cProfile
import pandas as pd
# inputs
empty_data = pd.DataFrame(columns=['store', 'timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
week_range = 'sun'
if weekno < 5:
week_range = 'm_f'
elif weekno == 5:
week_range = 'sat'
first_time = datetime(curr_date_year,curr_date_month,curr_date_day,data[week_range + '_open_hrs'].iloc[j],data[week_range + '_open_min'].iloc[j])
second_time = datetime(curr_date_year,curr_date_month,curr_date_day, data[week_range + '_close_hrs'].iloc[j],data[week_range + '_close_min'].iloc[j])
dts = [ dt.strftime('%Y-%m-%d %H:%M') for dt in datetime_range(first_time, second_time, timedelta(minutes=15)) ]
vert = pd.DataFrame(dts, columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data
Good luck!
I'm trying to use PVLIB to estimate output power for a PV System installed in the west of my country.
As an example I've got 2 days of hourly GHI, 2m Temperature and 10m wind speed from MERRA2 reanalysis.
I want to estimate how much power a fixed PV System or 1 axis tracking system would generate using the forementioned dataset, and ModelChain function from PVLIB. I first estimate DNI and DHI from GHI data using DISC model to obtain DNI and then DHI is the difference between GHI and DNI*cos(Z)
a) First behaviour I am not completely sure if it is Ok. Here is the plot of GHI, DNI , DHI, T2m and Wind Speed. It seems that DNI is shifted with its maximum occurring 1 hour before GHI maximum.
Weather Figure
After preparing irradiance data I calculated AC using Model Chain, specifying the fixed PV System and 1 axis single tracking system.
The thing is that I don't trust in the AC output for a 1-single axis system. I expected a plateau shape of AC output and i found a kind of weird behaviour.
Here is the otuput values of power generation i expected to see:
Expectation
And here is the estimated output by PVLIB
Reality
I hope someone can help me to find the error on my proccedure.
Here is the code:
# =============================================================================
# Example of using MERRA2 data and PVLIB
# =============================================================================
import numpy as np
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import pvlib
from pvlib.pvsystem import PVSystem
from pvlib.location import Location
from pvlib.modelchain import ModelChain
# =============================================================================
# 1) Create small data set extracted from MERRA
# =============================================================================
GHI = np.array([0,0,0,0,0,0,0,0,0,10.8,148.8,361,583,791.5,998.5,1105.5,1146.5,1118.5,1023.5,
860.2,650.2,377.1,165.1,16,0,0,0,0,0,0,0,0,0,11.3,166.2,395.8,624.5,827,986,
1065.5,1079,1025.5,941.5,777,581.5,378.9,156.2,20.6,0,0,0,0])
temp_air = np.array([21.5,20.5,19.7,19.6,18.8,17.9,17.1,16.5,16.2,16.2,17,21.3,24.7,26.9,28.8,30.5,
31.6,32.4,33,33.3,32.9,32,30.6,28.7,25.4,23.9,22.6,21.2,20.3,19.9,19.5,19.1,18.4,
17.7,18.3,23,25.1,27.3,29.5,31.2,32.1,32.6,32.6,32.5,31.8,30.7,29.6,28.1,24.6,22.9,
22.3,23.2])
wind_speed = np.array([3.1,2.7,2.5,2.6,2.8,3,3,3,2.8,2.5,2.1,1,2.2,3.7,4.8,5.6,6.1,6.4,6.5,6.6,6.3,5.8,5.3,
3.7,3.9,4,3.6,3.4,3.4,3,2.6,2.3,2.1,2,2.2,2.7,3.2,4.3,5.1,5.6,5.7,5.8,5.8,5.7,5.4,4.8,
4.4,3.1,2.7,2.3,1.1,0.6])
local_timestamp = pd.DatetimeIndex(start='1979-12-31 21:00', end='1980-01-03 00:00', freq='1h',tz='America/Argentina/Buenos_Aires')
d = {'ghi':GHI,'temp_air':temp_air,'wind_speed':wind_speed}
data = pd.DataFrame(data=d)
data.index = local_timestamp
lat = -31.983
lon = -68.530
location = Location(latitude = lat,
longitude = lon,
tz = 'America/Argentina/Buenos_Aires',
altitude = 601)
# =============================================================================
# 2) SOLAR POSITION AND ATMOSPHERIC MODELING
# =============================================================================
solpos = pvlib.solarposition.get_solarposition(time = local_timestamp,
latitude = lat,
longitude = lon,
altitude = 601)
# DNI and DHI calculation from GHI data
DNI = pvlib.irradiance.disc(ghi = data.ghi,
solar_zenith = solpos.zenith,
datetime_or_doy = local_timestamp)
DHI = data.ghi - DNI.dni*np.cos(np.radians(solpos.zenith.values))
d = {'ghi': data.ghi,'dni': DNI.dni,'dhi': DHI,'temp_air':data.temp_air,'wind_speed':data.wind_speed }
weather = pd.DataFrame(data=d)
plt.plot(weather)
# =============================================================================
# 3) SYSTEM SPECIFICATIONS
# =============================================================================
# load some module and inverter specifications
sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
cec_inverters = pvlib.pvsystem.retrieve_sam('cecinverter')
sandia_module = sandia_modules['Canadian_Solar_CS5P_220M___2009_']
cec_inverter = cec_inverters['Power_Electronics__FS2400CU15__645V__645V__CEC_2018_']
# Fixed system with tilt=abs(lat)-10
f_system = PVSystem( surface_tilt = abs(lat)-10,
surface_azimuth = 0,
module = sandia_module,
inverter = cec_inverter,
module_parameters = sandia_module,
inverter_parameters = cec_inverter,
albedo = 0.20,
modules_per_string = 100,
strings_per_inverter = 100)
# 1 axis tracking system
t_system = pvlib.tracking.SingleAxisTracker(axis_tilt = 0, #abs(-33.5)-10
axis_azimuth = 0,
max_angle = 52,
backtrack = True,
module = sandia_module,
inverter = cec_inverter,
module_parameters = sandia_module,
inverter_parameters = cec_inverter,
name = 'tracking',
gcr = .3,
modules_per_string = 100,
strings_per_inverter = 100)
# =============================================================================
# 4) MODEL CHAIN USING ALL THE SPECIFICATIONS for a fixed and 1 axis tracking systems
# =============================================================================
mc_f = ModelChain(f_system, location)
mc_t = ModelChain(t_system, location)
# Next, we run a model with some simple weather data.
mc_f.run_model(times=weather.index, weather=weather)
mc_t.run_model(times=weather.index, weather=weather)
# =============================================================================
# 5) Get only AC output form a fixed and 1 axis tracking systems and assign
# 0 values to each NaN
# =============================================================================
d = {'fixed':mc_f.ac,'tracking':mc_t.ac}
AC = pd.DataFrame(data=d)
i = np.isnan(AC.tracking)
AC.tracking[i] = 0
i = np.isnan(AC.fixed)
AC.fixed[i] = 0
plt.plot(AC)
I hope anyone could help me with the intepretation of the results and debugging of the code.
Thanks a lot!
I suspect your issue is due to the way the hourly data is treated. Be sure that you're consistent with the interval labeling (beginning/end) and treatment of instantaneous vs. average data. One likely cause is using hourly average GHI data to derive DNI data. pvlib.solarposition.get_solarposition returns the solar position at the instants in time that are passed to it. So you're mixing up hourly average GHI values with instantaneous solar position values when you use pvlib.irradiance.disc to calculate DNI and when you calculate DHI. Shifting your time index by 30 minutes will reduce, but not eliminate, the error. Another approach is to resample the input data to be of 1-5 minute resolution.
I have one model which looks like this:
class Measurement(models.Model):
date = models.DateField('date')
time = models.TimeField('time')
Q = models.DecimalField(max_digits=10, decimal_places=6)
P = models.DecimalField(max_digits=10, decimal_places=6)
f = models.DecimalField(max_digits=10, decimal_places=6)
In my views, I would like to represent it. So I made this function:
def plotMeas(request):
# Count the events
c = Measurement.objects.all()
c = c.count()
# Variables
i = 0
a = [0]
P = a*c
Q = a*c
t = a*c
# Save dP_L1 & dQ_L1 in lists
for i in range(c):
meas = Measurement.objects.get(pk = i+1)
P [i] = meas.P
Q [i] = meas.Q
t [c-1-i] = i*10
if c > 100:
P = P[-100:]
Q = Q[-100:]
t [i] = t[-100:]
# Construct the graph
fig = Figure()
q = fig.add_subplot(211)
q.set_xlabel("time (minutes ago)")
q.set_ylabel("Q (VAR)")
p = fig.add_subplot(212)
p.set_xlabel("time (minutes ago)")
p.set_ylabel("P (W)")
p.plot(t,P, 'go-')
q.plot(t,Q, 'o-')
canvas = FigureCanvas(fig)
response = HttpResponse(content_type='image/png')
canvas.print_png(response)
return response
However, I would like that the horizontal axis would show the date and the time (saved in the model). Does anyone know how to do it?
Have a look at the documentation for plot_date. Conveniently plot_date takes similar arguments to plot. A call might look like:
p.plot_date(sequence_of_datetime_objects, y_axis_values, 'go-')
Using matplotlib.dates you can then customize the format of your x-axis labels.
A simple example:
The following will specify that the x-axis displays only every third month in the format Jan '09 (assuming English-speaking locale).
p.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
p.xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
Since you have dates and times stored separately you may either want to
change your model to use a DateTimeField, or
use Python to combine them.
For example:
import datetime as dt
t1 = dt.time(21,0,1,2) # 21:00:01.2
d1 = dt.date.today()
dt1 = dt.datetime.combine(d1,t1)
# result: datetime.datetime(2011, 4, 15, 21, 0, 1, 2)
To iterate over two sequences and combine them you might use zip (code for illustrative purposes only, not necessarily optimized):
sequence_of_datetime_objects = []
for a_date, a_time in zip(sequence_of_date_objects, sequence_of_time_objects):
sequence_of_datetime_objects.append(dt.datetime.combine(a_date, a_time))
Feel free to open another question if you get stuck implementing the specifics.