Derived variables in Django Models - django

I am building a Django app that has a central Projects model:
class Project(models.Model):
fundamental_attrib1=models.IntegerField()
fundamental_attrib2=models.DecimalField(max_digits=10, decimal_places=2)
derived_attrib1=models.DecimalField(null=True, max_digits=10, decimal_places=2)
derived_attrib1_start=models.DateField(null=True)
derived_attrib1_end=models.DateField(null=True)
derived_attrib2=models.DecimalField(null=True, max_digits=10, decimal_places=2)
derived_attrib2_start=models.DateField(null=True)
derived_attrib2_end=models.DateField(null=True)
derived_attrib3=models.IntegerField(null=True)
derived_attrib3_start=models.DateField(null=True)
derived_attrib3_end=models.DateField(null=True)
The goal is to allow users to instantiate new projects where they can only see (and only need to) the 'fundamental' variables in the form to create/update a Project. Once they have submitted the form, I want to be calculate all the optional parameters before saving the project to the database.
In addition, most of my derived variables come in groups of three as above (value, start date, end date). Is there a better way (that makes sense) to store them in the database? Naive string example:
{'derived_attrib1':[1000,date(2017,1,1),date(2017,2,2)]}
{'derived_attrib2':[2000,date(2017,2,1),date(2017,3,2)]}
{'derived_attrib3':[ 500,date(2017,3,1),date(2017,4,2)]}
My eventual 'end goal' is, for each group:
create numpy arrays (or bring into one DataFrame?) to interpolate days between the start and end dates
linearly distribute the value across the days
plot each group as a timeseries (probably with D3.js/Bokeh or similar)

If these come in groups and can be seen as an entity, consider using ArrayField. For other backends, one could use a json representation in a TextField (or one of the json fields that work in the same way). Field choice depends on what comes in from the calculation and what your processing is most comfy with.
The choice not to do this, would be if you frequently would filter querysets on attributes of these entities, as while that's possible, it's not as fast as querying straight fields. Relations will be impossible.
A totally different approach is to use OneToOne fields to models. This creates a ton of joins for your approach, so I'm not recommending it, but it has some advantages in terms of handling each derived entity: it's fields and calculation method are independent of the model that uses them.

I would say ForeignKey in a related model would be the best bet. Then you can query Project.objects.filter(derived__start__gt=start, derived__end__lt=something). prefetch_related would only require two queries to get any amount of data from these two tables. This allows the number of properties to be infinite and you can query them any way you want.
class Derived(models.Model):
project = models.ForeignKey(Project, related_name='derived')
value=models.DecimalField(null=True, max_digits=10, decimal_places=2)
start=models.DateField(null=True)
end=models.DateField(null=True)

Related

GET Django products and their thumbnails from foreignKey with single query

I have these two Django models:
class Product(models.Model):
name=models.CharField(max_length=300)
description=models.CharField(max_length=10000)
class Thumnbnail(models.Model):
thumnbnail=models.ImageField(null=True)
product=models.ForeignKey(Product, related_name='related_product', on_delete=models.CASCADE)
User inputs some keyword, and based on that, I filter the product names, and matching product names are shown, together with all their product's thumbnails.
What is the most efficient way to retrieve all the products, and all their thumbnails, without querying twice separately for the products and for the thumbnails?
I know that using .select_related() we can fetch the foreign objects as well, but we can make either
a) two separate serializers for each model separately, thus requiring to have two separate viewsets, two querysets and two accesses to the database, or
b) nested Serializer, with fields from both models, but in that case we will get repeating data, the product_description will appear in every row for every thumbnail, which is also a problem, since that description can be many characters long, and will be an overkill to repeat it. Omitting it from the fields is not an option either, as I do need to get it at least once somehow.
Is there a third, more efficient way? I expect this to be possible with accessing the database only once, but I can't figure out a way to do it.

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

Django model for sparse data

I am developing a django app that contains a number of forms which will be used to enter clinical data on some cancer tissue samples (10-20 fields per form, mostly CharField, FloatField and some multiple choice text dropdowns).
My challenge is that I need a form that can display different fields based on a diagnosis, for 150+ diagnoses. I can programmatically read the list of diagnoses, the fields required for each diagnosis and corresponding field types. Also, the set of all unique fields across all diagnoses is large (much larger than the number of fields needed for any specific diagnosis).
e.g.
disease_specific_fields field_type
diagnosis
B-lymphoblastic leukemia/lymphoma NOS EBV-positive Pull down: Yes/No
B-lymphoblastic leukemia/lymphoma with recurrent genetic abnormalities(TCF3-PBX1) EBV-positive Pull down: Yes/No
Monoclonal B lymphocytosis(CLL/SLL spectrum) EBV-positive Pull down: Yes/No
Peripheral T cell lymphoma NOS EBV-positive Pull down: Yes/No
AML with recurrent cytogenetic abnormalities(t(6;9) DEK-NUP214) EBV-positive Pull down: Yes/No
So far, I thought of the following approaches:
Create a single huge model that will contain mostly sparse data, and handle irrelevant data using django forms. CONS: inefficient storage and a lot of overhead code tied to forms.
Create a model for each diagnosis. CONS: complicates migrations and maintenance, I think.
Create one small model for all diagnoses that contains several 'generic' fields of each type ('CharField', 'FloatField', etc), and render respective field names dynamically in forms / views.
I am looking for any constructive suggestions on how to implement a model/models capturing the above data. Efficiency and storage are secondary concerns, mostly I want a clean and intuitive solution. Any answers tailored for django will be especially helpful.
A few options I'd consider-
Use Django-Polymorphic to create inheritance-based model types
Django-Polymorphic allows you to use inheritance for differentiating between types of models.
from polymorphic.models import PolymorphicModel
class Animal(PolymorphicModel):
kingdom = models.CharField(default="Animalia")
class Lizard(Animal):
class = models.CharField(default="Reptilia")
class Iguana(Lizard):
favorite_tree = models.Charfield()
While polymorphic uses a single db table for any model in an inheritance scheme, types are stored. As such, if you know the specific fields you want to capture hard-code it. Plus, you can filter by level (So, you could run a query on all Animal instances or all Iguana instances in the example above). There's no relations created by a polymorphic model, so performance is extremely good.
Use Django-Mutant if dynamic field creation is needed
Django-Mutant allows for dynamic creation of fields per model, allowing you top define data as needed on the fly. However, intermediary tables are required to do this. You gain a lot of flexibility while losing performance.
Use the postgres-specific JsonField to store data
Django 1.9 introduced native support for field type JsonField, allowing you to write Json structures to a db field as well as query them relatively quickly. You get amazing flexibility with decent performance but may struggle in providing user friendly forms to create, update, and verify the data. However, it has been done in many projects and there are libraries out there to assist with it.
from django.contrib.postgres.fields import JSONField
from django.db import models
class SomeModel(models.Model):
attributes = JsonField()
>>> some_attributes = {'color':'red', 'cell_count':150, 'enzymes':['xyzyss','xyxzxxyx']}
>>> a = SomeModel.objects.create(attributes=some_attributes)
>>> SomeModel.objects.filter(attributes__color='red')
(<<< will return a queryset with instance 'a' in it >>>)

Do Django models really need a single unique key field

Some of my models are only unique in a combination of keys. I don't want to use an auto-numbering id as the identifier as subsets of the data will be exported to other systems (such as spreadsheets), modified and then used to update the master database.
Here's an example:
class Statement(models.Model):
supplier = models.ForeignKey(Supplier)
total = models.DecimalField("statement total", max_digits=10, decimal_places=2)
statement_date = models.DateField("statement date")
....
class Invoice(models.Model):
supplier = models.ForeignKey(Supplier)
amount = models.DecimalField("invoice total", max_digits=10, decimal_places=2)
invoice_date = models.DateField("date of invoice")
statement = models.ForeignKey(Statement, blank=True, null=True)
....
Invoice records are only unique for a combination of supplier, amount and invoice_date
I'm wondering if I should create a slug for Invoice based on supplier, amount and invoice_date so that it is easy to identify the correct record.
An example of the problem of having multiple related fields to identify the right record is django-csvimport which assumes there is only one related field and will not discriminate on two when building the foreign key links.
Yet the slug seems a clumsy option and needs some kind of management to rebuild the slugs after adding records in bulk.
I'm thinking this must be a common problem and maybe there's a best practice design pattern out there somewhere.
I am using PostgreSQL in case anyone has a database solution. Although I'd prefer to avoid that if possible, I can see that it might be the way to build my slug if that's the way to go, perhaps with trigger functions. That just feels a bit like hidden functionality though, and may cause a headache for setting up on a different server.
UPDATE - after reading initial replies
My application requires that data may be exported, modified remotely, and merged back into the master database after review and approval. Hidden autonumber keys don't easily survive that consistently. The relation invoices[2417] is part of statements[265] is not persistent if the statement table was emptied and reloaded from a CSV.
If I use the numeric autonumber pk then any process that is updating the database would need to refresh the related key numbers or by using the multiple WITH clause.
If I create a slug that is based on my 3 keys but easy to reproduce then I can use it as the key - albeit clumsily. I'm thinking of a slug along the lines:
u'%s %s %s' % (self.supplier,
self.statement_date.strftime("%Y-%m-%d"),
self.total)
This seems quite clumsy and not very DRY as I expect I may have to recreate the slug elsewhere duplicating the algorithm (maybe in an Excel formula, or an Access query)
I thought there must be a better way I'm missing but it looks like yuvi's reply means there should be, and there will be, but not yet :-(
What you're talking about it a multi-column primary key, otherwise known as "composite" or "compound" keys. Support in django for composite keys today is still in the works, you can read about it here:
Currently Django models only support a single column in this set,
denying many designs where the natural primary key of a table is
multiple columns [...] Current state is that the issue is
accepted/assigned and being worked on [...]
The link also mentions a partial implementation which is django-compositekeys. It's only partial and will cause you trouble with navigating between relationships:
support for composite keys is missing in ForeignKey and
RelatedManager. As a consequence, it isn't possible to navigate
relationships from models that have a composite primary key.
So currently it isn't entirely supported, but will be in the future. Regarding your own project, you can make of that what you will, though my own suggestion is to stick with the fully supported default of a hidden auto-incremented field that you don't even need to think about (and use unique_together to enforce the uniqness of the described fields instead of making them your primary keys).
I hope this helps!
No.
Model needs to have one field that is primary_key = True. By default this is the (hidden) autofield which stores object Id. But you can set primary_key to True at any other field.
I've done this in cases, Where i'm creating django project upon tables which were previously created manually or through some other frameworks/systems.
In reality - you can use whatever means you can think of, for joining objects together in queries. As long as query returns bunch of data that can be associated with models you have - it does not really matter which field you are using for joins. Just keep in mind, that the solution you use should be as effective as possible.
Alan

Are there performance advantages by splitting a Django model/table into two models/tables?

In SO question 7531153, I asked the proper way to split a Django model into two—either using Django's Multi-table Inheritance or explicitly defining a OneToOneField.
Based Luke Sneeringer's comment, I'm curious if there's a performance gain from splitting the model in two.
The reason I was thinking about splitting the model in two is because I have some fields that will always be completed, while there are other fields that will typically be empty (until the project is closed).
Are there performance gains from putting typically empty fields, such as actual_completion_date and actual_project_costs, into a separate model/table in Django?
Split into Two Models
class Project(models.Model):
project_number = models.SlugField(max_length=5, blank=False,
primary_key=True)
budgeted_costs = models.DecimalField(max_digits=10, decimal_places=2)
submitted_on = models.DateField(auto_now_add=True)
class ProjectExtendedInformation(models.Model):
project = models.OneToOneField(CapExProject, primary_key=True)
actual_completion_date = models.DateField(blank=True, null=True)
actual_project_costs = models.DecimalField(max_digits=10, decimal_places=2,
blank=True, null=True)
Actually, quite the opposite. Any time multiple tables are involved, a SQL JOIN will be required, which is inherently slower for a database to perform than a simple SELECT query. The fact that the fields are empty is meaningless in terms of performance one way or another.
Depending on the size of the table and the number of columns, it may be faster to only select a subset of fields that you need to interact with, but that's easy enough in Django with the only method:
Project.objects.only('project_number', 'budgeted_costs', 'submitted_on')
Which produces something akin to:
SELECT ('project_number', 'budgeted_costs', 'submitted_on') FROM yourapp_project;
Using separate models (and tables) only makes sense for the purposes of modularization -- such that you subclass Project to create a specific kind of project that requires additional fields but still needs all the fields of a generic Project.
For your case, if there's some info that's available only when it's closed, I'd indeed advise making a separate model.
Joins aren't bad. Especially in your case the join will be faster if you have all rows in one table and much fewer rows in the other one. I've worked with databases a lot, and in most cases it's a pure guess to tell if a join will be better or worse. Even a full table scan is better than using an index in many cases. You need to look at the EXPLAINs, if performance is a concern, and profile the Db work if possible (I know Oracle supports this.) But before performance becomes an issue, I prefer quicker development.
We have a table in Django with 5M rows. And we needed a column that would have been not null only for 1K rows. Just altering the table would have taken half a day. Rebuilding from scratch also takes a few hours. We've chosen to make a separate model.
I've been to a lecture on Domain Driven Design in which the author explained that it is important, especially in development of a new app, to separate models, to not stuff everything in one class.
Let's say you have a CargoAircraft class and PassengerAircraft. It's so tempting to put them in one class and work "seamlessly", isn't it? But interactions with them (scheduling, booking, weight or capacity calculations) are completely different.
So, by putting everything in one class you force yourself to bunch of IF clauses in every method, to extra methods in Manager, to harder debugging, to bigger tables in the DB. Basically you make yourself spend more time developing for the sake of what? For only two things: 1) fewer joins 2) fewer class names.
If you separate the classes, things go much easier:
clean code, no ugly ifs, no .getattr and defaults
easy debugging
more mainainable database
hence, faster development.