Define time period in database for analyses - django

I want to store timeseries in a database. The values of these timeseries are usually defined for a certain period, for example:
country population in 2014, 2015, 2016, etc.
number of houses in country in 2014, 2015, 2016
I want to combine the data of these varabiales to be able to do some statstics, so housing vs population. This is only possible if I make sure the time periods are exactly the same. The periods are usually on a per year/quarter/month basis. How to best store these values such that I can later compare them?
I currently use start_date (datetime) and end_date (datetime), which obviously works but needs a good GUI to prevent that one person enters for example:
start = 1-1-2016 & end = 31-12=2016
while another would enter:
start = 1-1-2016 & end = 1-1=2017
I think it would be a good idea to keep the freedom of defining the period with the user but help them in defining the right thing. How would you suggest to do this?
BTW: I work with Django so my current model has the following two fields:
period_start = models.DateField(null=False)
period_end = models.DateField(null=False)
Edit 8-5-2018 10:32: added some information on storing data
Some extra information for added clarity:
I store my data in two tables: (1) the variable definition and (2) the values.
Variable defintion looks roughly like this:
class VarDef(models.Model):
name = models.CharField(max_length=2000, null=False)
unit = models.CharField(max_length=20, null=False)
desc = models.CharField(max_length=2000, blank=True, null=True)
class VarValue(models.Model):
value = models.DecimalField(max_digits=60, decimal_places=20, null=False)
var = models.ForeignKey(VarDef, on_delete=models.CASCADE, null=False,
related_name='var_values')
period_start = models.DateField(null=False)
period_end = models.DateField(null=False)

It is hard to answer since I don't have your full code(models, views etc). But keep in mind that you can query django datetime fields using lt and gt like this:
import datetime
# user input from your view, I hardcoded just for the sake of the example
start_date = datetime.date(2005, 1, 1)
end_date = datetime.date(2005, 3, 31)
house_data = HousesData.objects.filter(period_start__gt=start_date, period_end__lt=end_date)).all()
country_data = CountryData.objects.filter(period_start__gt=start_date, period_end__lt=end_date)).all()
# Do the rest of your calculation

Related

How could you make this really reaaally complicated raw SQL query with django's ORM?

Good day, everyone. Hope you're doing well. I'm a Django newbie, trying to learn the basics of RESTful development while helping in a small app project. Currently, there's a really difficult query that I must do to create a calculated field that updates my student's status accordingly to the time interval the classes are in. First, let me explain the models:
class StudentReport(models.Model):
student = models.ForeignKey(Student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(Teacher, on_delete=models.CASCADE,)
upload = models.ForeignKey(Upload, on_delete=models.CASCADE, related_name='reports', blank=True, null=True,)
exams_date = models.DateTimeField(null=True, blank=True)
#Other fields that don't matter
class ExamCycle(models.Model):
student = models.ForeignKey(student, on_delete=models.CASCADE,)
headroom_teacher = models.ForeignKey(Teacher, on_delete=models.CASCADE,)
#Other fields that don't matter
class RecommendedClasses(models.Model):
report = models.ForeignKey(Report, on_delete=models.CASCADE,)
range_start = models.DateField(null=True)
range_end = models.DateField(null=True)
# Other fields that don't matter
class StudentStatus(models.TextChoices):
enrolled = 'enrolled' #started class
anxious_for_exams = 'anxious_for_exams'
sticked_with_it = 'sticked_with_it' #already passed one cycle
So this app will help the management of a Cram school. We first do an initial report of the student and its best/worst subjects in StudentReport. Then a RecommendedClasses object is created that tells him which clases he should enroll in. Finally, we have a cycle of exams (let's say 4 times a year). After he completes each exam, another report is created and he can be recommended a new class or to move on the next level of its previous class.
I'll use the choices in StudentStatus to calculate an annotated field that I will call status on my RecommendedClasses report model. I'm having issues with the sticked_with_it status because it's a query that it's done after one cycle is completed and two reports have been made (Two because this query must be done in StudentStatus, after 2nd Report is created). A 'sticked_with_it' student has a report created after exams_date where RecommendedClasses was created and the future exams_date time value falls within the 30 days before range_start and 60 days after the range_end values of the recommendation (Don't question this, it's just the way the higherups want the status)
I have already come up with two ways to do it, but one is with a RAW SQL query and the other is waaay to complicated and slow. Here it is:
SELECT rec.id AS rec_id FROM
school_recommendedclasses rec LEFT JOIN
school_report original_report
ON rec.report_id = original_report.id
AND rec.teacher_id = original_report.teacher_id
JOIN reports_report2 future_report
ON future_report.exams_date > original_report.exams_date
AND future_report.student_id = original_report.student_id
AND future_report.`exams_date` > (rec.`range_start` - INTERVAL 30 DAY)
AND future_report.`exams_date` <
(rec.`range_end` + INTERVAL 60 DAY)
AND original_report.student_id = future_report.student_id
How can I transfer this to a proper DJANGO ORM that is not so painfully unoptimized? I'll show you the other way in the comments.
FWIW, I find this easier to read, but there's very little wrong with your query.
Transforming this to your ORM should be straightforward, and any further optimisations are down to indexes...
SELECT r.id rec_id
FROM reports_recommendation r
JOIN reports_report2 o
ON o.id = r.report_id
AND o.provider_id = r.provider_id
JOIN reports_report2 f
ON f.initial_exam_date > o.initial_exam_date
AND f.patient_id = o.patient_id
AND f.initial_exam_date > r.range_start - INTERVAL 30 DAY
AND f.initial_exam_date < r.range_end + INTERVAL 60 DAY
AND f.provider_id = o.provider_id

Django annotation on compoundish primary key with filter ignoring primary key resutling in too many annotated items

Please see EDIT1 below, as well.
Using Django 3.0.6 and python3.8, given following models
class Plants(models.Model):
plantid = models.TextField(primary_key=True, unique=True)
class Pollutions(models.Model):
pollutionsid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
pollutant = models.TextField()
releasesto = models.TextField(blank=True, null=True)
amount = models.FloatField(db_column="amount", blank=True, null=True)
class Meta:
managed = False
db_table = 'pollutions'
unique_together = (('plantid', 'releasesto', 'pollutant', 'year'))
class Monthp(models.Model):
monthpid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
month = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
power = models.IntegerField(null=False)
class Meta:
managed = False
db_table = 'monthp'
unique_together = ('plantid', 'year', 'month')
I'd like to annotate - based on a foreign key relationship and a fiter a value, particulary - to each plant the amount of co2 and the Sum of its power for a given year. For sake of debugging having replaced Sum by Count using the following query:
annotated = tmp.all().annotate(
energy=Count('monthp__power', filter=Q(monthp__year=YEAR)),
co2=Count('pollutions__amount', filter=Q(pollutions__year=YEAR, pollutions__pollutant="CO2", pollutions__releasesto="Air")))
However this returns too many items (a wrong number using Sum, respectively)
annotated.first().co2 # 60, but it should be 1
annotated.first().energy # 252, but it should be 1
although my database guarantees - as denoted, that (plantid, year, month) and (plantid, releasesto, pollutant, year) are unique together, which can easily be demonstrated:
pl = annotated.first().plantid
testplant = Plants.objects.get(pk=pl) # plant object
pco2 = Pollutions.objects.filter(plantid=testplant, year=YEAR, pollutant="CO2", releasesto="Air")
len(pco2) # 1, as expected
Why does django return to many results and how can I tell django to limit the elements to annotate to the 'current primary key' in other words to only annotate the elements where the foreign key matches the primary key?
I can achieve what I intend to do by using distinct and Max:
energy=Sum('yearly__power', distinct=True, filter=Q(yearly__year=YEAR)),
co2=Max('pollutions__amount', ...
However the performance is inacceptable.
I have tested to use model_to_dict and appending the wanted values "by hand" to the dict, which works for the values itself, but not for sorting the resulted dict (e.g. by energy) and it is acutally faster than the workaround directly above.
It conceptually strikes to me that the manual approach is faster than letting the database do, what it is intended to do.
Is this a feature limitation of django's orm or am I missing something?
EDIT1:
The behaviour is known as bug since 11 years.
Even others "spent a whole day on this".
I am now trying it with subqueries. However the forein key I am using is not a primary key of its table. So the kind of "usual" approach to use "pk=''" does not work. More clearly, trying:
tmp = Plants.objects.filter(somefilter)
subq1 = Subquery(Yearly.objects.filter(pk=OuterRef('plantid'), year=YEAR)) tmp1 = tmp.all().annotate(
energy=Count(Subquery(subq1))
)
returns
OperationalError at /xyz
no such column: U0.yid
Which definitely makes sense because Plants has no clue what a yid is, it only knows plantids. How do I adjust the subquery to that?

Django filter model with timestamp

I have the following models:
class User(models.Model):
id = models.CharField(max_length=10, primary_key=True)
class Data(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
timestamp = models.IntegerField(default=0)
Given a single user, I would like to know how I can filter using timestamp. For example:
Obtain the data from user1, between now and 1 hour ago.
I have the current timestamp with now = time.time(), also I have 1 hour ago using hour_ago = now-3600
I would like to obtain the Data that has a timestamp between these two values.
Use range to obtain data between two values.
You can use range anywhere you can use BETWEEN in SQL — for dates, numbers and even characters.
e.g.
Data.objects.filter(timestamp__range=(start, end))
from docs:
import datetime
start_date = datetime.date(2005, 1, 1)
end_date = datetime.date(2005, 3, 31)
Entry.objects.filter(pub_date__range=(start_date, end_date))
You can use __gte which is termed as greater than or equal to and __lte refers less than or equal toSo try this,
Data.objects.filter(timestamp__gte=hour_ago,timestamp__lte=now)
You can find similar examples in official doc

Filtering data by date range

I have my payment model I what to be able to select by date
class LeasePayment(CommonInfo):
version = IntegerVersionField( )
amount = models.DecimalField(max_digits=7, decimal_places=2)
lease = models.ForeignKey(Lease)
leaseterm = models.ForeignKey(LeaseTerm)
payment_date = models.DateTimeField()
method = models.CharField(max_length=2, default='Ch',
choices=PAYMENT_METHOD_CHOICES)
Basically I want to be able to input 2 dates and display all the data between them . Righ now I started to implement this solution https://groups.google.com/forum/#!topic/django-filter/lbi_B4zYq4M based on django_filter. However since the task is pretty trivial was wondering if there an easier way.
Try to use date__range it will return data from database in selected date interval:
LeasePayment.objects.filter(payment_date__range=[start_date, end_date])

Slow iteration over django queryset

I am iterating over a django queryset that contains anywhere from 500-1000 objects. The corresponding model/table has 7 fields in it as well. The problem is that it takes about 3 seconds to iterate over which seems way too long when considering all the other data processing that needs to be done in my application.
EDIT:
Here is my model:
class Node(models.Model):
node_id = models.CharField(null=True, blank=True, max_length=30)
jobs = models.TextField(null=True, blank=True)
available_mem = models.CharField(null=True, blank=True, max_length=30)
assigned_mem = models.CharField(null=True, blank=True ,max_length=30)
available_ncpus = models.PositiveIntegerField(null=True, blank=True)
assigned_ncpus = models.PositiveIntegerField(null=True, blank=True)
cluster = models.CharField(null=True, blank=True, max_length=30)
datetime = models.DateTimeField(auto_now_add=False)
This is my initial query, which is very fast:
timestamp = models.Node.objects.order_by('-pk').filter(cluster=cluster)[0]
self.nodes = models.Node.objects.filter(datetime=timestamp.datetime)
But then, I go to iterate and it takes 3 seconds, I've tried two ways as seen below:
def jobs_by_node(self):
"""returns a dictionary containing keys that
are strings of node ids and values that
are lists of the jobs running on that node."""
jobs_by_node = {}
#iterate over nodes and populate jobs_by_node dictionary
tstart = time.time()
for node in self.nodes:
pass #I have omitted the code because the slowdown is simply iteration
tend = time.time()
tfinal = tend-tstart
return jobs_by_node
Other method:
all_nodes = self.nodes.values('node_id')
tstart = time.time()
for node in all_nodes:
pass
tend = time.time()
tfinal = tend-tstart
I tried the second method by referring to this post, but it still has not sped up my iteration one bit. I've scoured the web to no avail. Any help optimizing this process will be greatly appreciated. Thank you.
Note: I'm using Django version 1.5 and Python 2.7.3
Check the issued SQL query. You can use print statement:
print self.nodes.query # in general: print queryset.query
That should give you something like:
SELECT id, jobs, ... FROM app_node
Then run EXPLAIN SELECT id, jobs, ... FROM app_node and you'll know what exactly is wrong.
Assuming that you know what the problem is after running EXPLAIN, and that simple solutions like adding indexes aren't enough, you can think about e.g. fetching the relevant rows to a separate table every X minutes (in a cron job or Celery task) and using that separate table in you application.
If you are using PostgreSQL you can also use materialized views and "wrap" them in an unmanaged Django model.