Django ElasticSearch DSL DRF aggregations - django

What is the correct approach to adding aggregations to Django ElasticSearch DSL DRF? The codebase contains some empty filter backends
https://github.com/barseghyanartur/django-elasticsearch-dsl-drf/tree/a2be5842e36a102ad66988f5c74dec984c31c89b/src/django_elasticsearch_dsl_drf/filter_backends/aggregations
Should I create a custom backend or is there a way to add them directly to my viewset?
In particular, I want to calculate the sum of an IntegerField across all results in a particular facet.

Elasticsearch has more than one type of aggregations. Simple aggregations in the django-elasticsearch-dsl-drf are implemented in the FacetedSearchFilterBackend. Read the docs or simply run the example project to run experiments.

Related

Handle large amounts of time series data in Django while preserving Django's ORM

We are using Django with its ORM in connection with an underlying PostgreSQL database and want to extend the data model and technology stack to store massive amounts of time series data (~5 million entries per day onwards).
The closest questions I found were this and this which propose to combine Django with databases such as TimescaleDB or InfluxDB. But his creates parallel structures to Django's builtin ORM and thus does not seem to be straightforward.
How can we handle large amounts of time series data while preserving or staying really close to Django's ORM?
Any hints on proven technology stacks and implementation patterns are welcome!
Your best option is to keep your relational data in Postgres and your time series data in a separate database, and combining them when needed in your code.
With InfluxDB you can do this join with a Flux script by passing it the SQL that Django's ORM would execute, along with your database connection info. This will return your data in InfluxDB's format though, not Django models.
why not using in parallel to your existing postgres a timescaledb for the time series data, and use this django integration for the latter one: https://pypi.org/project/django-timescaledb/.
Using multiple databases in django is possible, also I not did it by myself so far. Have a look here to do it in a convenient way (reroute certain Models to another db instead of default postgres one)
Using Multiple Databases with django

Need guidance with creating Django based dashboard

I'm a beginner at Django, and as a practice project I would like to create a webpage with a dashboard to track investments in a particular p2p platform. They do not have a nice dashboard (but provide excel file with all data). As I see it, main steps that I need to do in this project are as follow:
Create login so that users would have account where they upload their excel files.
Make it possible to import excel file to a database
Manipulate/calculate data for it to be later used in dashboard
Create dashboard.
Host webpage.
After some struggle I have implemented point no. 2, and will deal with 1 and 5 later. But number 3 is my biggest issue now.
I'm completely unsure what I need to do, and google did not help. I need to calculate data before I can make dashboard from it. Union two of the tables, and then join them together with a third table, creating some additional needed calculated fields. Do I create a view in the database and somehow fetch this data to Django? Or do I need to create some rules so that new table would be created at the time of the import? I think having table instead of a view would have better performance. Or maybe I'm doing it completely wrong, and should take completely different approach for this kind of task? Also, is SQLite a good database for a task (I'm using it, because it was a default in Django)?
I assume for vizualization part I will need to do it with some JavaScript library, such as D3? Which then would use data from step 3.
For part 3 there is 2 way, either do these stuff and save the result in your database or you can do it when you need it using django model features like annotation, aggregation and etc.
Option 1 requires to add a table for you calculation which is Models in django.
Option 2 requires to create a doing the annotations in a view or model managers and then using them in views.
Django docs: Aggregation
Which is the best is depended on how big your data is, how complicated the calculation is and how often you need them.
And for database; SQLite is just a database for development use not the production and surly not with a lot of data and a lot of calculations. The recommended database for django is postgresql which is pretty good at handling millions and even billions of data and doing heavy calculation.
And for vizualization you should handle it on the template side which is basically HTML, CSS and JS.

Django - fulltext search with PostgreSQL and Elasticsearch

I have a Django and Django REST Framework powered RESTful API (talking to a PostgreSQL DB backend) which supports filtering on a specific model.
Now I want to add a fulltext search functionality.
Is it be possible to use Elasticsearch for fulltext search and then apply my existing API filters on top of these search results?
I would suggest you consider using PostgreSQL only to do what you asked for.
In my opinion it is the best solution because you will have the data and the search indexes directly inside PostgreSQL and you will not be forced to install and maintain additional software (such as Elasticsearch) and keep the data and indexes in sync.
This is the simplest code example you can have to perform a full-text search in Django with PostgreSQL:
Entry.objects.filter(body_text__search='Cheese')
For all the basic documentation on using the full-text search in Django with PostgreSQL you can use the official documentation: "Full text search"
If you want to deepen further you can read an article that I wrote on the subject:
"Full-Text Search in Django with PostgreSQL"
Your question is too broad to be answered with code, but it's definitely possible.
You can easily search your elasticsearch for rows matching your full-text criteria.
Then get those rows' PK fields (or any other candidate key, used to uniquely identify rows in your PostgreSQL dB), and filter your django ORM-backed models for PKs matching those you found from your Elasticsearch.
Pseudocode would be:
def get_chunk(l, length):
for i in xrange(0, len(l), length):
yield l[i:i + length]
res = es.search(index="index", body={"query": {"match": ...}})
pks = []
for hit in res['hits']:
pks.append(hit['pk'])
for chunk_10k in get_chunk(pks, 10000):
DjangoModel.objects.filter(pk__in=chunk_10k, **the_rest_of_your_api_filters)
EDIT
To resolve a case in which lots and lots of PKs might be found with your elastic query, you can define a generator that yields successive 10K rows of the results, so you won't step over your DB query limit and to ensure best update performance. I've defined it above with a function called get_chunk.
Something like that would work for alternatives like redis, mongodb, etc ...

using database routers to shard a table

I am trying to use django's database routers to shard my database, but I am not able to find a solution for that.
I'd like to define two databases, create the same table in both and then save the even rows in one db, the odd ones in the other one. The examples in the documentation show how to write to a master db and to read from the readonly slaves, which is not what I want, because I don't want to store the whole dataset in both dbs.
Do know any webpage explaining what I am trying to do?
Thank you
PS: I am using Postgresql and I know there are tools to achieve the same goal at DB level. My goal is to study if it can also be done in django and to explore if there are some advantages by doing this.

Warehousing records from a flat item table: Django Signals or PostgreSQL Triggers?

I have a Django website with a PostgreSQL database. There is a Django app and model for a 'flat' item table with many records being inserted regularly, up to millions of inserts per month. I would like to use these records to automatically populate a star schema of fact and dimension tables (initially also modeled in the Django models.py), in order to efficiently do complex queries on the records, and present data from them on the Django site.
Two main options keep coming up:
1) PostgreSQL Triggers: Configure the database directly to insert the appropriate rows into fact and dimensional tables, based on creation or update of a record, possibly using Python/PL-pgsql and row-level after triggers. Pros: Works with inputs outside Django; might be expected to be more efficient. Cons: Splits business logic to another location; triggering inserts may not be expected by other input sources.
2) Django Signals: Use the Signals feature to do the inserts upon creation or update of a record, with the built-in signal django.db.models.signals.post_save. Pros: easier to build and maintain. Cons: Have to repeat some code or stay inside the Django site/app environment to support new input sources.
Am I correct in thinking that Django's built-in signals are the way to go for maintaining the fact table and the dimension tables? Or is there some other, significant option that is being missed?
I ended up using Django Signals. With a flat table "item_record" containing fields "item" and "description", the code in models.py looks like this:
from django.db.models.signals import post_save
def create_item_record_history(instance, created, **kwargs):
if created:
ItemRecordHistory.objects.create(
title=instance.title,
description=instance.description,
created_at=instance.created_at,
)
post_save.connect(create_item_record_history, sender=ItemRecord)
It is running well for my purposes. Although it's just creating an annotated flat table (new field "created_at"), the same method could be used to build out a star schema.