Collecting Relational Data and Adding to a Database Periodically with Python

Collecting Relational Data and Adding to a Database Periodically with Python - django

I have a project that :
fetches data from active directory
fetches data from different services based on active directory data
aggregates data
about 50000 row have to be added to database in every 15 min
I'm using Postgresql as database and django as ORM tool. But I'm not sure that django is the right tools for such projects. I have to drop and add 50000 rows data and I'm worry about performance.
Is there another way to do such process?

50k rows/15m is nothing to worry about.
But I'd make sure to use bulk_create to avoid 50k of round trips to the database, which might be a problem depending on your database networking setup.

For sure there are other ways, if that's what you're asking. But Django ORM is quite flexible overall, and if you write your queries carefully there will be no significant overhead. 50000 rows in 15 minutes is not really big enough. I am using Django ORM with PostgreSQL to process millions of records a day.

You can write a custom Django management command for this purpose, Then call it like
python manage.py collectdata
Here is the documentation link

Related

What common approaches do I have to scaling data imports into a DB where the table structure is defined by Django ORM?

In the current project I'm working on we have a monolith Django webapp consisting of multiple Django "apps", each with many models and DjangoORM defining the table layout for a single instance Postgres database (RDS).
On a semi-regular basis we need to do some large imports, hundreds of thousands of rows of inserts and updates, into the DB which we use the DjangoORM models in Jupyter because of ease of use. Django models make code simple and we have a lot of complex table relationships and celery tasks that are driven by write events.
Edit: We batch writes to the DB on import with bulk_create or by using transactions where it's useful to do so.
These imports have grown, and cause performance degradation or can be rate limited and take weeks, by which time data is worth a lot less - I've optimized pre-fetching and queries as much as possible, and testing is fairly tight around this. The AWS dashboard tells me the instance is running really hot during these imports, but then going back to normal after as you would expect.
At other places I've worked, there's been a separate store that all ETL stuff gets transformed into and then some reconciliation process which is either streaming or at a quiet hour. I don't understand how to achieve this cleanly when Django is in control of table structures.
How do I achieve a scenario where:
Importing data triggers all the actions that a normal DjangoORM write would
Importing data doesn't degrade performance or take forever to complete
Is maintainable and easy to use
Any reading material or links on the topic would be amazing, finding it difficult to find examples of people scaling out of this stage of Django.

The best way for integration Django and Scrapy

I know some ways like scrapy-djangoitem but as it has mentioned:
DjangoItem is a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy. This is because a relational backend is often not a good choice for a write intensive applications (such as a web crawler), specially if the database is highly normalized and with many indices.
So what is the best way to use scraped items in db and django models?

It is not about Django ORM but rather about the database you choose as backend. What it says is that if you are expecting to write millions of items to your tables, relational database systems might not be your best choice here (MySQL, Postgres ...) and it can be even worse in terms of performance if you add many indicies since your application is write-heavy (Database must update B-Trees or other structures for keeping index on every write).
I would suggest sticking with Postgres or MySQL for now and look for another solution if you start to have performance issues on database level.

How to move from a database backend to another on a production Django project?

I would like to move a database in a Django project from a backend to another (in this case azure sql to postgresql, but I want to think of it as a generic situation). I can't use a dump since the databases are different.
I was thinking of something at the django level, like dumpdata, but depending on the amount of available memory and the size of the db sometimes it appears unreliable and crashes.
I have seen solutions that try to break the process into smaller parts that the memory can handle but it was a few years ago, so I was hoping to find other solutions.
So far my searches have failed since they always lead to 'south', which refers to schema migration and not moving data.

I have not implemented this before, but what about the following:
Django supports multiple databases...so just configure DATABASES in your settings file to support the old postgresql database and the azure sql database. Then create a small script that makes use of bulk_create, reading the data from one DB and writing it to the other.

How to cache MySQL table in C++ web service

I've got a big users table that I'm caching in a C++ web service (BitTorrent tracker). The entire table is refetched every 5 minutes. This has some drawbacks, like data being up to 5 minutes old and refetching lots of data that hasn't changed.
Is there a simple way to fetch just the changes since last time?
Ideally I'd not have to change the queries that update the data.

Two immediate possibilities come to me:
MySQL Query Cache
Memcached (or similar) Caching Layer
I would try the query cache first as it likely is far easier to setup. Do some basic tests/benchmarks and see if it fits your needs. Memcached will likely be very similar to your existing cache but, as you mention, you'll to find a better way of invalidating stale cache entries (something that the query cache does for you).

Django slow queries: Connect django filter statements to slow queries in database logs

If you are trying to diagnose slow queries in your mysql backend and are using a Django frontend, how do you tie together the slow queries reported by the backend with specific querysets in the Django frontend code?

I think you has no alternative besides logging every django query for the suspicious querysets.
See this answer on how to access the actual query for a given queryset.

If you install django-devserver, it will show you the queries that are being run and the time they take in your shell when using runserver.
Another alternative is django-debug-toolbar, which will do the same in a side panel-overlay on your site.
Either way, you'll need to test it out in your development environment. However, neither really solves the issue of pinpointing you directly to the offending queries; they work on a per-request basis. As a result, you'll have to do a little thinking about which of your views are using the database most heavily and/or deal with exceptionally large amounts of data, but by cherry-picking likely-candidate views and inspecting the times for the queries to run on those pages, you should be able to get a handle on which particular queries are the worst.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Collecting Relational Data and Adding to a Database Periodically with Python - django

50k rows/15m is nothing to worry about. But I'd make sure to use bulk_create to avoid 50k of round trips to the database, which might be a problem depending on your database networking setup.

You can write a custom Django management command for this purpose, Then call it like python manage.py collectdata Here is the documentation link

Related

What common approaches do I have to scaling data imports into a DB where the table structure is defined by Django ORM?

The best way for integration Django and Scrapy

How to move from a database backend to another on a production Django project?

How to cache MySQL table in C++ web service

Django slow queries: Connect django filter statements to slow queries in database logs

Categories

Resources