Django model for sparse data - django

I am developing a django app that contains a number of forms which will be used to enter clinical data on some cancer tissue samples (10-20 fields per form, mostly CharField, FloatField and some multiple choice text dropdowns).
My challenge is that I need a form that can display different fields based on a diagnosis, for 150+ diagnoses. I can programmatically read the list of diagnoses, the fields required for each diagnosis and corresponding field types. Also, the set of all unique fields across all diagnoses is large (much larger than the number of fields needed for any specific diagnosis).
e.g.
disease_specific_fields field_type
diagnosis
B-lymphoblastic leukemia/lymphoma NOS EBV-positive Pull down: Yes/No
B-lymphoblastic leukemia/lymphoma with recurrent genetic abnormalities(TCF3-PBX1) EBV-positive Pull down: Yes/No
Monoclonal B lymphocytosis(CLL/SLL spectrum) EBV-positive Pull down: Yes/No
Peripheral T cell lymphoma NOS EBV-positive Pull down: Yes/No
AML with recurrent cytogenetic abnormalities(t(6;9) DEK-NUP214) EBV-positive Pull down: Yes/No
So far, I thought of the following approaches:
Create a single huge model that will contain mostly sparse data, and handle irrelevant data using django forms. CONS: inefficient storage and a lot of overhead code tied to forms.
Create a model for each diagnosis. CONS: complicates migrations and maintenance, I think.
Create one small model for all diagnoses that contains several 'generic' fields of each type ('CharField', 'FloatField', etc), and render respective field names dynamically in forms / views.
I am looking for any constructive suggestions on how to implement a model/models capturing the above data. Efficiency and storage are secondary concerns, mostly I want a clean and intuitive solution. Any answers tailored for django will be especially helpful.

A few options I'd consider-
Use Django-Polymorphic to create inheritance-based model types
Django-Polymorphic allows you to use inheritance for differentiating between types of models.
from polymorphic.models import PolymorphicModel
class Animal(PolymorphicModel):
kingdom = models.CharField(default="Animalia")
class Lizard(Animal):
class = models.CharField(default="Reptilia")
class Iguana(Lizard):
favorite_tree = models.Charfield()
While polymorphic uses a single db table for any model in an inheritance scheme, types are stored. As such, if you know the specific fields you want to capture hard-code it. Plus, you can filter by level (So, you could run a query on all Animal instances or all Iguana instances in the example above). There's no relations created by a polymorphic model, so performance is extremely good.
Use Django-Mutant if dynamic field creation is needed
Django-Mutant allows for dynamic creation of fields per model, allowing you top define data as needed on the fly. However, intermediary tables are required to do this. You gain a lot of flexibility while losing performance.
Use the postgres-specific JsonField to store data
Django 1.9 introduced native support for field type JsonField, allowing you to write Json structures to a db field as well as query them relatively quickly. You get amazing flexibility with decent performance but may struggle in providing user friendly forms to create, update, and verify the data. However, it has been done in many projects and there are libraries out there to assist with it.
from django.contrib.postgres.fields import JSONField
from django.db import models
class SomeModel(models.Model):
attributes = JsonField()
>>> some_attributes = {'color':'red', 'cell_count':150, 'enzymes':['xyzyss','xyxzxxyx']}
>>> a = SomeModel.objects.create(attributes=some_attributes)
>>> SomeModel.objects.filter(attributes__color='red')
(<<< will return a queryset with instance 'a' in it >>>)

Related

How to censor specific fields based on condition using django QuerySet API

Using Django we have situation where we have a model Case which can be set to being a medical case or not (through a BooleanField).
Now, we also have a system to check if a certain employee (User subclass) is authorized to see sensitive data when a case is labeled as being medical (containing medical sensitive data).
I am able to annotate a new field to each instance, a BooleanField letting us know if the requesting employee is authorized to see medical data on the specific Case instance or not.
Ideally, I would like to have the database sensor out specific fields (field customer for example), when the requesting employee is not authorized to see medical data for that case. I imagine this can be done with an annotate method, and a combination of from django.db.models.Case and from django.db.models.When.
But, what we would also like is that the resulting QuerySet keeps the same field names on the different model instances. We don't want to change the field name of customer to another name.
We have actually come up with a solution, using .values first, and then the .annotate for each field we want to potentially censor out (see code below). This isn't ideal though, for multiple reasons. For one, we don't get back model instances, but dictionaries. Also, but this is another question, one of the fields that needs to be censored is a ManyToManyField, and using .values now returns a unique row for each instance referred to through the ManyToManyField (any solution for that?)
Also, ideally, this queryset would be the base queryset for all situations in which an employee tries to request Cases in our app. We want all our colleagues to use this base queryset so that we don't have to implement the same solution in multiple places, and prevent sensitive data from leaking.
So, I am wondering, can anyone recommend a solution for this situation?
Thanks in advance!
PS. We would like to have this done by the database since the amount of cases being fetched is potentially very high, and doing this in Python would probably require a lot of CPU power and thus kill performance.
from django.db.models import Case, When, BooleanField, IntegerField, F, Value, Q
OurModel.objects.annotate(
employee_medical_authorized=Case(
When(..., then=Value(True)),
default=Value(False),
output_field=BooleanField()
)).values(...).annotate(
customer=Case(
When(Q(employee_medical_authorized=Value(False)) & Q(medical=Value(True)),
then=Value(None)),
default=F('customer'),
output_field=IntegerField()
)
)

Using a Textfield with JSON instead of a ForeignKey relationship?

I am working on a project where users can roll dice pools. A dice pool is, for example, a throw with 3 red dice, 2 blue and 1 green. A pool is composed of several dice rolls and modifiers.
I have 3 models connected this way:
class DicePool(models.Model):
# some relevant fields
class DiceRoll(models.Model):
pool = models.ForeignKey(DicePool, on_delete=models.CASCADE)
# plus a few more information fields with the type of die used, result, etc
class Modifier(models.Model):
pool = models.ForeignKey(DicePool, on_delete=models.CASCADE)
# plus about 4 more information fields
Now, when I load the DicePool history, I need to prefetch both the DiceRoll and Modifier.
I am now considering replacing the model Modifier with a textfield containing some JSON in DicePool. Just to reduce the number of database queries.
Is it common to use a json textfield instead of a database relationship?
Or am I thinking this wrong and it's completely normal to do additional queries to prefetch_related everytime I load my pools?
I personally find using a ForeignKey cleaner and it would let me do db-wise changes to data if needed. But my code is making too many db queries and I am trying to see where I can improve it.
FYI: I am using MySQL
Is it common to use a JSON text field instead of a database relationship?
I don't think so. Also, I don't believe it's advisable because (especially using MySQL that doesn't support things like JSONField) you'll end up with a text that you'd then need to parse somehow to a dict and then look up the things you want.
Personally (and I would assume that most people) would stick to FK relationships. Also, by doing prefetch_related or select_related you're already avoiding unnecessary queries.

Django, is filtering by string faster than SQL relationships?

Is it a major flaw if I'm querying my user's information by their user_id (string) rather than creating a Profile model and linking them to other models using SQL relationships?
Example 1: (user_id is stored in django sessions.)
class Information(models.Model):
user_id = models.CharField(...)
...
# also applies for .filter() operations.
information = Information.objects.get(user_id=request.getUser['user_id'])
note: I am storing the user's profile informations on Auth0.
Example 2: (user_id is stored in Profile.)
class Profile(models.Model):
user_id = models.CharField(...)
class Information(models.Model):
profile = models.ForeginKey(Profile, ...)
...
information = Information.objects.get(profile=request.getProfile)
note: With this method Profile will only have one field, user_id.
On Django, will using a string instead of a query object affect performances to retrieve items?
Performance is not an issue here as noted by Dirk; as soon as a column is indexed, the performance difference between data types should be negligible when compared to other factors. Here's a related SO question for more perspective.
What you should take care of is to prevent the duplication of data whose integrity you then would have to take care of on your own instead of relying on well-tested integrity checks in the database.
Another aspect is that if you do have relations between your data, you absolutely should make sure that they are accurately represented in your models using Django's relationships. Otherwise there's really not much point in using Django's ORM at all. Good luck!

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

Seeking Good Approach to Persist Data Submitted Through Dynamic Django Forms

Summary:
Looking for a good way to save data to Django models for which the associated forms are generated dynamically.
Detail:
I've been puzzling over the best approach for creating dynamic Django forms backed by models. For example, I'd like to create an interface where a user can create an HTML form, customize the types of fields in that form dynamically (Number, String, Dropdown Box, Date, etc.), and then display that form to other users so those users can submit data which is saved to a database. I'm not sure how to make an efficient approach to persist the data.
www.formsite.com and www.mailchimp.com have some form-building tools that are nice examples of what I am trying to do. Jacob Kaplan-Moss has an excellent tutorial on how to create the forms dynamically, but the tutorial doesn't get into how to persist the data.
As a dummy example, one (perhaps bad?) approach might be to create some models like below, where there is a database table for the SurveyQuestions (storing the customizable names and datatypes of each field) and one for the SurveyQuestionResponses (each record storing an individual response for a SurveyQuestion on a particular Survey).
However, it seems like this approach might result in really complex and slow queries. For example, if a Survey has 10 questions and you would like to display 10 user responses to that survey, there would be queries to select all 10 SurveyQuestions and then for each survey responder, there would be a query to select each of the SurveyQuestionResponses. It seems like the number of queries needed could add up really fast!
class Survey(models.Model):
# some fields here.
pass
class SurveyQuestion(models.Model):
""" Defines the headings and field
types for a given Survey.
"""
survey = models.ForeignKey(Survey)
field_name = models.CharField(
max_length=255,
help_text='Enter the name for this field heading')
field_type = models.IntegerField(
choices=choices.FIELD_TYPES,
help_text='Enter the data type for this field')
display_order = models.IntegerField(default=0)
class SurveyQuestionResponse(models.Model):
survey_field = models.ForeignKey(SurveyQuestion)
response value = models.TextArea(blank=True, null=True)
Is there a better approach to persisting data based on dynamic forms? Should I be somehow converting a form respondent's response to some sort of pickled format and store it to a TextField (Instead of having 10 SurveyQuestionResponse records there would be one record with all the response values pickled together)? I'm not too familiar with NoSQL options, but would a NoSQL approach work best for this type of thing? Is there some sort of rendering or caching that would make sense to do?
I keep encountering situations where saving data from dynamic forms like this would be very useful. I am wondering what other people's approaches are. Any advice is much appreciated. Thanks for reading this admittedly long question.
Joe
For a relational database an Entity-attribute-value model(EAV) could be used to achieve a dynamic, or open schema. Relational databases are not really suited for this type of schema, and this generally results in very slow queries over time. NoSQL has its own set of issues but I think that it would be best suited to your requirements. If you decide to take this route you can take a look at MongoDB. I have not used it extensively, but it seems most similar to relational database than the other NoSQL database out there, and its python interface seems pretty similar to django's ORM. By the was I remember finding a nice EAV example for Django. Though I don't remember where at the moment.