Django publish to KAFKA instead of save to database - django

We are trying to develop a microservice architecture in Django rest framework where all writes to a SQL database are done by publishing messages to topics on a Apache Kafka messaging system, and all reads are done directly from an SQL database. This is being done for several reasons, but most importantly it's to have an event log history through Kafka, and to avoid delays due to slower write cycles (reads to databases are usually much faster than writes).
Using signals, it's quite easy to call a Kafka producer method in Django which will publish an event on Kafka showing the model data to be written to db. However, we don't know how to avoid having the save() method write anything to the database. We don't want to completely override the save() method as it may cause issues when dealing with ManytoMany relationships.
Does someone have a simple suggestion to avoid database writes but still do database reads in Django? Please keep in mind that we still want to use the conventional method to read from the SQL database using Django models.

Related

What common approaches do I have to scaling data imports into a DB where the table structure is defined by Django ORM?

In the current project I'm working on we have a monolith Django webapp consisting of multiple Django "apps", each with many models and DjangoORM defining the table layout for a single instance Postgres database (RDS).
On a semi-regular basis we need to do some large imports, hundreds of thousands of rows of inserts and updates, into the DB which we use the DjangoORM models in Jupyter because of ease of use. Django models make code simple and we have a lot of complex table relationships and celery tasks that are driven by write events.
Edit: We batch writes to the DB on import with bulk_create or by using transactions where it's useful to do so.
These imports have grown, and cause performance degradation or can be rate limited and take weeks, by which time data is worth a lot less - I've optimized pre-fetching and queries as much as possible, and testing is fairly tight around this. The AWS dashboard tells me the instance is running really hot during these imports, but then going back to normal after as you would expect.
At other places I've worked, there's been a separate store that all ETL stuff gets transformed into and then some reconciliation process which is either streaming or at a quiet hour. I don't understand how to achieve this cleanly when Django is in control of table structures.
How do I achieve a scenario where:
Importing data triggers all the actions that a normal DjangoORM write would
Importing data doesn't degrade performance or take forever to complete
Is maintainable and easy to use
Any reading material or links on the topic would be amazing, finding it difficult to find examples of people scaling out of this stage of Django.

Should I implement revisioning using database triggers or using django-reversion?

We're looking into implementing audit logs in our application and we're not sure how to do it correctly.
I know that django-reversion works and works well but there's a cost of using it.
The web server will have to make two roundtrips to the database when saving a record even if the save is in the same transaction because at least in postgres the changes are written to the database and comitting the transaction makes the changes visible.
So this will block the web server until the revision is saved to the database if we're not using async I/O which is currently the case. Even if we would use async I/O generating the revision's data takes CPU time which again blocks the web server from handling other requests.
We can use database triggers instead but our DBA claims that offloading this sort of work to the database will use resources that are meant for handling more transactions.
Is using database triggers for this sort of work a bad idea?
We can scale both the web servers using a load balancer and the database using read/write replicas.
Are there any tradeoffs we're missing here?
What would help us decide?
You need to think about the pattern of db usage in your website.
Which may be unique to you, however most web apps read much more often than they write to the db. In fact it's fairly common to see optimisations done, to help scaling a web app, which trade off more complicated 'save' operations to get faster reads. An example would be denormalisation where some data from related records is copied to the parent record on each save so as to avoid repeatedly doing complicated aggregate/join queries.
This is just an example, but unless you know your specific situation is different I'd say don't worry about doing a bit of extra work on save.
One caveat would be to consider excluding some models from the revisioning system. For example if you are using Django db-backed sessions, the session records are saved on every request. You'd want to avoid doing unnecessary work there.
As for doing it via triggers vs Django app... I think the main considerations here are not to do with performance:
Django app solution is more 'obvious' and 'maintainable'... the app will be in your pip requirements file and Django INSTALLED_APPS, it's obvious to other developers that it's there and working and doesn't need someone to remember to run the custom SQL on the db server when you move to a new server
With a db trigger solution you can be certain it will run whenever a record is changed by any means... whereas with Django app, anyone changing records via a psql console will bypass it. Even in the Django ORM, certain bulk operations bypass the model save method/save signals. Sometimes this is desirable however.
Another thing I'd point out is that your production webserver will be multiprocess/multithreaded... so although, yes, a lengthy db write will block the webserver it will only block the current process. Your webserver will have other processes which are able to server other requests concurrently. So it won't block the whole webserver.
So again, unless you have a pattern of usage where you anticipate a high frequency of concurrent writes to the db, I'd say probably don't worry about it.

Using Redis as intermediary cache for REST API

We have an iOS app that talks to a django server via REST API. Most of the data consists of rather large Item objects that involve a few related models that render into single flat dictionary, and this data changes rarely.
We've found, that querying this is not a problem for Postgres, but generating JSON responses takes a noticeable amount of time. On the other hand, item collections vary per-user.
I thought about a rendering system, where we just build a dictionary for Item object and save it into redis as JSON string, this way we can serve API directly from redis (e.g. HMGET(id of items in user library), which is fast, and makes it relatively easy to regenerate "rendered instances", basically just a couple of post_save signals.
I wonder how good this design is, are there any major flaws in it? Maybe there's a better way for the task?
Sure, we do the same at our firm, using Redis to store not JSON but large XML strings which are generated from backend databases for RESTful requests, and it saves lots of network hops and overhead.
A few things to keep in mind if this is the first time you're using Redis...
Dedicated Redis Server
Redis is single-threaded and should be deployed on a dedicated server with sufficient CPU power. Don't make the mistake of deploying it on your app or database server.
High Availability
Set up Redis with Master/Slave replication for high availability. I know there's been lots of progress with Redis cluster, so you may want to check on that too for HA.
Cache Hit/Miss
When checking Redis for a cache "hit", if the connection is dead or any exception occurs, don't fail the request, just default to the database; caching should always be 'best effort' since the database can always be used as a last resort.

Retaining state between Django views

As a little backstory, I'm working on an application which pipes KML to googleearth based on packet data from a mesh network. Example:
UDP Packet ---> Django ORM to place organized data in DB ---> Django view to read the DB and return a KML representation of the packet data (gps, connections, etc) to Google Earth.
The problem here being that the DB rows tell a story and doing a query, or a series of queries, isn't enough to "paint a picture" of this mesh network. I need to retain some internal python structures and classes to maintain a "state" of the network between requests/responses.
Here is where I need help. Currently, to retain this "state", I use Django's low level cache API to store a class with unlimited timeout. And every request, I just retrieve that class from the cache, add to it's structures, and save it back to the cache. This seems to be working, and pretty well actually; but it doesn't feel right.
Maybe I should ditch Django and extend Python's BaseHTTP class to handle my requests/responses?
Maybe I should create a separate application to retain the "state" and Django pipes it request data through a socket?
I just feel like I'm misusing Django and being unsafe with crucial data. And help?
I know this is unconventional and a little crazy.
(Note: I'm currently using Django's ORM outside of a Django instance for the UDP socket listener, so I am aware I can use Django's environment outside of an instance.)
Maybe I should ditch Django and extend Python's BaseHTTP class to handle my requests/responses?
Ditching Django for Python's BaseHTTP won't change the fact that HTTP is a stateless protocol and you want to add state to it. You are correct that storing state in the cache is somewhat volatile depending on the cache backend. It's possible you could switch this to the session instead of the cache.
Maybe I should create a separate application to retain the "state" and Django pipes it request data through a socket?
Yes this seems like a viable option. Again HTTP is stateless so if you want state you need to persist it somewhere and the DB is another place you could store this.
This really sounds like the kind of storage problem Redis and MongoDB are made to efficiently handle. You should be able to find a suitable data structure to keep track of your packet data and matching support for creating cheap, atomic updates to boot.

How to do "Lazy-write" per-request logging in Django?

Using Django, I need to do some per-request logging that involves database writes.
I understand Django's process_request() and process_response() middleware hooks, but as far as I can tell, those hooks are in the critical path (by design) for rendering the Web page response.
I'd prefer not to have my post-request database write operations holding up the response time for the page.
Is there a simple design pattern for Django that lets me do a "lazy log write", where I can use request hooks to gather information during request processing, but any follow-on operations and the actual log write operation does not occur until after the response is written to the user?
I'm using WSGI currently, but would prefer the most general solution possible.
Django implements a request_finished-signal which is fired after the processing of the response has been finished, but the down side is, it will not allow you to access the current request object which makes it not quite useful for logging... The latest place to hook into django's response processing is most probably in the HttpResponse class itself. You could eg. store the data temporarily in request.session and write them to the database in the close() method.
But I guess there are also other alternatives you should consider: You could use something like Celery to handle your logging tasks asynchronously. Furthermore there are non-sql databases like MongoDB that offer you handy and performant features for logging, eg. you dont have to wait until the changes are really committed to the database which can give you a big performance advantage.