Should I use JSONField over ForeignKey to store data? - django

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?

A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

Related

Why is the database used by Django subqueries not sticky?

I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?
This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related

Django model for sparse data

I am developing a django app that contains a number of forms which will be used to enter clinical data on some cancer tissue samples (10-20 fields per form, mostly CharField, FloatField and some multiple choice text dropdowns).
My challenge is that I need a form that can display different fields based on a diagnosis, for 150+ diagnoses. I can programmatically read the list of diagnoses, the fields required for each diagnosis and corresponding field types. Also, the set of all unique fields across all diagnoses is large (much larger than the number of fields needed for any specific diagnosis).
e.g.
disease_specific_fields field_type
diagnosis
B-lymphoblastic leukemia/lymphoma NOS EBV-positive Pull down: Yes/No
B-lymphoblastic leukemia/lymphoma with recurrent genetic abnormalities(TCF3-PBX1) EBV-positive Pull down: Yes/No
Monoclonal B lymphocytosis(CLL/SLL spectrum) EBV-positive Pull down: Yes/No
Peripheral T cell lymphoma NOS EBV-positive Pull down: Yes/No
AML with recurrent cytogenetic abnormalities(t(6;9) DEK-NUP214) EBV-positive Pull down: Yes/No
So far, I thought of the following approaches:
Create a single huge model that will contain mostly sparse data, and handle irrelevant data using django forms. CONS: inefficient storage and a lot of overhead code tied to forms.
Create a model for each diagnosis. CONS: complicates migrations and maintenance, I think.
Create one small model for all diagnoses that contains several 'generic' fields of each type ('CharField', 'FloatField', etc), and render respective field names dynamically in forms / views.
I am looking for any constructive suggestions on how to implement a model/models capturing the above data. Efficiency and storage are secondary concerns, mostly I want a clean and intuitive solution. Any answers tailored for django will be especially helpful.
A few options I'd consider-
Use Django-Polymorphic to create inheritance-based model types
Django-Polymorphic allows you to use inheritance for differentiating between types of models.
from polymorphic.models import PolymorphicModel
class Animal(PolymorphicModel):
kingdom = models.CharField(default="Animalia")
class Lizard(Animal):
class = models.CharField(default="Reptilia")
class Iguana(Lizard):
favorite_tree = models.Charfield()
While polymorphic uses a single db table for any model in an inheritance scheme, types are stored. As such, if you know the specific fields you want to capture hard-code it. Plus, you can filter by level (So, you could run a query on all Animal instances or all Iguana instances in the example above). There's no relations created by a polymorphic model, so performance is extremely good.
Use Django-Mutant if dynamic field creation is needed
Django-Mutant allows for dynamic creation of fields per model, allowing you top define data as needed on the fly. However, intermediary tables are required to do this. You gain a lot of flexibility while losing performance.
Use the postgres-specific JsonField to store data
Django 1.9 introduced native support for field type JsonField, allowing you to write Json structures to a db field as well as query them relatively quickly. You get amazing flexibility with decent performance but may struggle in providing user friendly forms to create, update, and verify the data. However, it has been done in many projects and there are libraries out there to assist with it.
from django.contrib.postgres.fields import JSONField
from django.db import models
class SomeModel(models.Model):
attributes = JsonField()
>>> some_attributes = {'color':'red', 'cell_count':150, 'enzymes':['xyzyss','xyxzxxyx']}
>>> a = SomeModel.objects.create(attributes=some_attributes)
>>> SomeModel.objects.filter(attributes__color='red')
(<<< will return a queryset with instance 'a' in it >>>)

How to retrieve values from Django ForeignKey -> ManyToMany fields?

I have a model (Realtor) with a ForeignKey field (BillingTier), which has a ManyToManyField (BillingPlan). For each logged in realtor, I want to check if they have a billing plan that offers automatic feedback on their listings. Here's what the models look like, briefly:
class Realtor(models.Model):
user = models.OneToOneField(User)
billing_tier = models.ForeignKey(BillingTier, blank=True, null=True, default=None)
class BillingTier(models.Model):
plans = models.ManyToManyField(BillingPlan)
class BillingPlan(models.Model):
automatic_feedback = models.BooleanField(default=False)
I have a permissions helper that checks the user permissions on each page load, and denies access to certain pages. I want to deny the feedback page if they don't have the automatic feedback feature in their billing plan. However, I'm not really sure the best way to get this information. Here's what I've researched and found so far, but it seems inefficient to be querying on each page load:
def isPermitted(user, url):
premium = [t[0] for t in user.realtor.billing_tier.plans.values_list('automatic_feedback') if t[0]]
I saw some solutions which involved using filter (ManyToMany field values from queryset), but I'm equally unsure of using the query for each page load. I would have to get the billing tier id from the realtor: bt_id = user.realtor.billing_tier.id and then query the model like so:
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct()
I think the second option reads nicer, but I think the first would perform better because I wouldn't have to import and query the BillingTier model.
Is there a better option, or are these two the best I can hope for? Also, which would be more efficient for every page load?
As per the OP's invitation, here's an answer.
The core question is how to define an efficient permission check based on a highly relational data model.
The first variant involves building a Python list from evaluating a Django query set. The suspicion must certainly be that it imposes unnecessary computations on the Python interpreter. Although it's not clear whether that's tolerable if at the same time it allows for a less complex database query (a tradeoff which is hard to assess), the underlying DB query is not exactly simple.
The second approach involves fetching additional 1:1 data through relational lookups and then checking if there is any record fulfilling access criteria in a different, 1:n relation.
Let's have a look at them.
bt_id = user.realtor.billing_tier.id: This is required to get the hook for the following 1:n query. It is indeed highly inefficient in itself. It can be optimized in two ways.
As per Django: Access Foreign Keys Directly, it can be written as bt_id = user.realtor.billing_tier_id because the id is of course present in billing_tier and needs not be found via a relational operation.
Assuming that the page in itself would only load a user object, Django can be told to fetch and cache relational data along with that via select_related. So if the page does not only fetch the user object but the required billing_tier_id as well, we have saved one additional DB hit.
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct() can be optimized using Django's exists because that will redurce efforts both in the database and regarding data traffic between the database and Python.
Maybe even Django's prefetch_related can be used to combine the 1:1 and 1:n queries into a single query, but it's much more difficult to judge whether that pays. Could be worth a try.
In any case, it's worth installing a gem called Django Debug Toolbar which will allow you to analyze how much time your implementation spends on database queries.

Filter on a list of tags

I'm trying to select all the songs in my Django database whose tag is any of those in a given list. There is a Song model, a Tag model, and a SongTag model (for the many to many relationship).
This is my attempt:
taglist = ["cool", "great"]
tags = Tag.objects.filter(name__in=taglist).values_list('id', flat=True)
song_tags = SongTag.objects.filter(tag__in=list(tags))
At this point I'm getting an error:
DatabaseError: MultiQuery does not support keys_only.
What am I getting wrong? If you can suggest a completely different approach to the problem, it would be more than welcome too!
EDIT: I should have mentioned I'm using Django on Google AppEngine with django-nonrel
You shouldn't use m2m relationship with AppEngine. NoSQL databases (and BigTable is one of them) generally don't support JOINs, and programmer is supposed to denormalize the data structure. This is a deliberate design desicion: while your database will contain redundant data, your read queries will be much simpler (no need to combine data from 3 tables), which in turn makes the design of DB server much simpler as well (of course this is made for the sake of optimization and scaling)
In your case you should probably get rid of Tag and SongTag models, and just store the tag in the Song model as a string. I of course assume that Tag model only contains id and name, if Tag in fact contains more data, you should still have Tag model. Song model in that case should contain both tag_id and tag_name. The idea, as I explained above, is to introduce redundancy for the sake of simpler queries
Please, please let the ORM build the query for you:
song_tags = SongTag.objects.filter(tag__name__in = taglist)
You should try to use only one query, so that Django also generates only one query using a join.
Something like this should work:
Song.objects.filter(tags__name__in=taglist)
You may need to change some names from this example (most likely the tags in tags__name__in), see https://docs.djangoproject.com/en/1.3/ref/models/relations/.

Seeking Good Approach to Persist Data Submitted Through Dynamic Django Forms

Summary:
Looking for a good way to save data to Django models for which the associated forms are generated dynamically.
Detail:
I've been puzzling over the best approach for creating dynamic Django forms backed by models. For example, I'd like to create an interface where a user can create an HTML form, customize the types of fields in that form dynamically (Number, String, Dropdown Box, Date, etc.), and then display that form to other users so those users can submit data which is saved to a database. I'm not sure how to make an efficient approach to persist the data.
www.formsite.com and www.mailchimp.com have some form-building tools that are nice examples of what I am trying to do. Jacob Kaplan-Moss has an excellent tutorial on how to create the forms dynamically, but the tutorial doesn't get into how to persist the data.
As a dummy example, one (perhaps bad?) approach might be to create some models like below, where there is a database table for the SurveyQuestions (storing the customizable names and datatypes of each field) and one for the SurveyQuestionResponses (each record storing an individual response for a SurveyQuestion on a particular Survey).
However, it seems like this approach might result in really complex and slow queries. For example, if a Survey has 10 questions and you would like to display 10 user responses to that survey, there would be queries to select all 10 SurveyQuestions and then for each survey responder, there would be a query to select each of the SurveyQuestionResponses. It seems like the number of queries needed could add up really fast!
class Survey(models.Model):
# some fields here.
pass
class SurveyQuestion(models.Model):
""" Defines the headings and field
types for a given Survey.
"""
survey = models.ForeignKey(Survey)
field_name = models.CharField(
max_length=255,
help_text='Enter the name for this field heading')
field_type = models.IntegerField(
choices=choices.FIELD_TYPES,
help_text='Enter the data type for this field')
display_order = models.IntegerField(default=0)
class SurveyQuestionResponse(models.Model):
survey_field = models.ForeignKey(SurveyQuestion)
response value = models.TextArea(blank=True, null=True)
Is there a better approach to persisting data based on dynamic forms? Should I be somehow converting a form respondent's response to some sort of pickled format and store it to a TextField (Instead of having 10 SurveyQuestionResponse records there would be one record with all the response values pickled together)? I'm not too familiar with NoSQL options, but would a NoSQL approach work best for this type of thing? Is there some sort of rendering or caching that would make sense to do?
I keep encountering situations where saving data from dynamic forms like this would be very useful. I am wondering what other people's approaches are. Any advice is much appreciated. Thanks for reading this admittedly long question.
Joe
For a relational database an Entity-attribute-value model(EAV) could be used to achieve a dynamic, or open schema. Relational databases are not really suited for this type of schema, and this generally results in very slow queries over time. NoSQL has its own set of issues but I think that it would be best suited to your requirements. If you decide to take this route you can take a look at MongoDB. I have not used it extensively, but it seems most similar to relational database than the other NoSQL database out there, and its python interface seems pretty similar to django's ORM. By the was I remember finding a nice EAV example for Django. Though I don't remember where at the moment.