How to improve 2 million data query speed in Django RESTful APIs - django

I have a scientific research publications data of 2 Million records. I used django restframework to write apis for searching the data in title and abstract. This is taking me 12 seconds while using postgres as db, but if I used MongoDB as db, it goes down to 6seconds.
But even 6 seconds sounds a lot of waiting for user to me. I indexed the title and abstract, but abstract indexing failed because some of the abstract texts are too lengthy.
Here is the django Model using MongoDB(MongoEngine as ODM):
class Journal(Document):
title = StringField()
journal_title = StringField()
abstract = StringField()
full_text = StringField()
pub_year = IntField()
pub_date = DateTimeField()
pmid = IntField()
link = StringField()
How do I improve the query performance, what stack makes the search and retrieval more faster?.

Some pointers about optimisation for the Django ORM with Postgres:
Use db_index=True on fields that will be search upon often and have some degree of repetition between entries, like "title".
Use values() and values_list() to select only the columns you want from a QuerySet.
If you're doing full text search in any of those columns (like a contains query), bear in mind that Django has support for full text search directly on a Postgres database.
Use print queryset.query to check what kind of SQL query is going into your database and if it can be improved upon.
Many Postgres optimisation techniques rely in custom SQL queries that can be made in Django by using RawSQL expressions.
Remember that there are many, many ways to search for data in a database, be it relational or not-relational in nature. In your case, MongoDB is not "faster" than Postgres, it's just doing a better job at querying what you really want.

Related

Django: Get the ManyToMany object of ManyToManyField

I know I can fetch all authors of a paper like:
paper.authors.all()
This works fine, but just returns me a QuerySet of Authors.
But I want the ManyToMany Object like (because I want to sort after the ID's)
(id (BigAutoField), paper, author)
Is there a faster way to do it then:
Paper.authors.through.objects.all().filter(paper=paper)
Because my Database is really Large ~200 million entries, the command above is not feasible
My Model looks like:
class Paper(models.Model, ILiterature):
authors = models.ManyToManyField(Author, blank=True)
(...)
You can try to select in bulk,
papers = Paper.authors.through.in_bulk(ids)
Django bulk commands are faster and designed for massive DB's like yours. You can check https://levelup.gitconnected.com/optimizing-django-queries-28e96ad204de here for details.

How to structure django models and query them with ease?

I am building a Django web App that will count the total number of persons entering and exiting a school library in a day, week and year and then save to DB.
The Web App uses a camera that is controlled by OpenCv to show live feed on frontend (I have successfully implemented this already).
My problem is:
How can I design and structure my models to store each data by day, week, month and year?
And how can I query them to display them on different Bar Charts using chart.js?
I haven't used chart.js before but I think I can answer the first part of your question.
Consider this model from one of my projects for a "post" that a user can make on my webapp.
class Post(models.Model):
slug = models.SlugField(unique=True)
title = models.CharField(max_length=100)
description = models.CharField(max_length=2200)
image = models.ImageField(upload_to=photo_path, blank=False, null=True)
timestamp = models.DateTimeField(auto_now_add=True)
Using a "DateTimeField" (or alternatively a "DateField") you can pretty easily store timestamp information which can be filtered using standard python Date or DateTime object comparisons. In my example, I'm storing image files and text information.
For your case you could simply create a new "Person" model where each person is given a timestamp (and whatever other info you might want to store) and then using django querying you can count how many people match certain datetime parameters.
Note the Django Docs (https://docs.djangoproject.com/en/3.1/ref/models/querysets/) recommend :
Don't use len() on QuerySets if all you want to do is determine the number of records in the set. It's much more efficient to handle a count at the database level, using SQL's SELECT COUNT(*), and Django provides a count() method for precisely this reason.
An example of how I'd approach your problem would be:
Models:
class Person(HabitModel):
timestamp = models.DateTimeField(auto_now_add=True)
#whatever extra data you want on each person walking by
#staticmethod
def get_number_of_people(start_timestamp, end_timestamp):
return Person.objects.filter(timestamp__gte=start_timestamp, timestamp__lt=end_timestamp)).count()
(Note the "__gte" and "__lt" are built-in for Django querying and imply [start_timestamp, end_timestamp) inclusive start time and exclusive endtime)
Now you should be able to store your data rather simply and quantify how many people objects were created in whatever timeframe you'd like!

Should I use ArrayField or ManyToManyField for tags

I am trying to add tags to a model for a postgres db in django and I found two solutions:
using foreign keys:
class Post(models.Model):
tags = models.ManyToManyField('tags')
...
class Tag(models.Model):
name = models.CharField(max_length=140)
using array field:
from django.contrib.postgres.fields import ArrayField
class Post(models.Model):
tags = ArrayField(models.CharField(max_length=140))
...
assuming that I don't care about supporting other database-backends in my code, what is a recommended solution ?
If you use an Array field,
The size of each row in your DB is going to be a bit large thus Postgres is going to be using more toast tables
Every time you get the row, unless you specifically use defer the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M field,
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded.
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields.
The cost of the joins in this case is completely negligible because of keys.
With that being said, the above answer doesn't belong to me. A while ago, I had stumbled upon this dilemma when I was learning Django. I had found the answer here in this question, Django Postgres ArrayField vs One-to-Many relationship.
Hope you get what you were looking for.
If you want the class tags to be monitored ( For eg : how many tags, how many of a particular tag etd ) , the go for the first option as you can add more fields to the model and will add richness to the app.
On the other hand, if you just want it to be a array list just for sake of displaying or minimal processing, go for that option.
But if you wish to save time and add richness to the app, you can use this
https://github.com/alex/django-taggit
It is as simple as this to initialise :
from django.db import models
from taggit.managers import TaggableManager
class Food(models.Model):
# ... fields here
tags = TaggableManager()
and can be used in the following way :
>>> apple = Food.objects.create(name="apple")
>>> apple.tags.add("red", "green", "delicious")
>>> apple.tags.all()
[<Tag: red>, <Tag: green>, <Tag: delicious>]

Django-Python/MySQL: How can I access a field of a table in the database that is not present in a model's field?

This is what I wanted to do:
I have a table imported from another database. Majority of the columns of one of the tables look something like this: AP1|00:23:69:33:C1:4F and there are a lot of them. I don't think that python will accept them as field names.
I wanted to make an aggregate of them without having to list them as fields in the model. As much as possible I want the aggregation to be triggered from within the Django application, so I don't want to resort to having to create MySQL queries outside the application.
Thanks.
Unless you want to write raw sql, you're going to have to define a model. Since your model fields don't HAVE to be named the same thing as the column they represent, you can give your fields useful names.
class LegacyTable(models.Model):
useful_name = models.IntegerField(db_column="AP1|00:23:69:33:C1:4F")
class Meta:
db_table = "LegacyDbTableThatHurtsMyHead"
managed = False # syncdb does nothing
You may as well do this regardless. As soon as you require the use of another column in your legacy database table, just add another_useful_name to your model, with the db_column set to the column you're interested in.
This has two solid benefits. One, you no longer have to write raw sql. Two, you do not have to define all the fields up front.
The alternative is to define all your fields in raw sql anyway.
Edit:
Legacy Databases describes a method for inspecting existing databases, and generating a models.py file from existing schemas. This may help you by doing all the heavy lifting (nulls, lengths, types, fields). Then you can modify the definition to suit your needs.
python manage.py inspectdb > legacy.py
http://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly
Django allows you to perform raw sql queries. Without more information about your tables that's about all that I can offer.
custom query:
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
# Data modifying operation - commit required
cursor.execute("UPDATE bar SET foo = 1 WHERE baz = %s", [self.baz])
transaction.commit_unless_managed()
# Data retrieval operation - no commit required
cursor.execute("SELECT foo FROM bar WHERE baz = %s", [self.baz])
row = cursor.fetchone()
return row
acessing other databases:
from django.db import connections
cursor = connections['my_db_alias'].cursor()
# Your code here...
transaction.commit_unless_managed(using='my_db_alias')

Django ManyToMany in one query

I'm trying to optimise my app by keeping the number of queries to a minimum... I've noticed I'm getting a lot of extra queries when doing something like this:
class Category(models.Model):
id = models.AutoField(primary_key=True)
name = models.CharField(max_length=127, blank=False)
class Project(models.Model):
categories = models.ManyToMany(Category)
Then later, if I want to retrieve a project and all related categories, I have to do something like this :
{% for category in project.categories.all() %}
Whilst this does what I want it does so in two queries. I was wondering if there was a way of joining the M2M field so I could get the results I need with just one query? I tried this:
def category_list(self):
return self.join(list(self.category))
But it's not working.
Thanks!
Which, whilst does what I want, adds an extra query.
What do you mean by this? Do you want to pick up a Project and its categories using one query?
If you did mean this, then unfortunately there is no mechanism at present to do this without resorting to a custom SQL query. The select_related() mechanism used for foreign keys won't work here either. There is (was?) a Django ticket open for this but it has been closed as "wontfix" by the Django developers.
What you want is not seem to possible because,
In DBMS level, ManyToMany relatin is not possible, so an intermediate table is needed to join tables with ManyToMany relation.
On Django level, for your model definition, django creates an ectra table to create a ManyToMany connection, table is named using your two tables, in this example it will be something like *[app_name]_product_category*, and contains foreignkeys for your two database table.
So, you can not even acces to a field on the table with a manytomany connection via django with a such categories__name relation in your Model filter or get functions.