How to do "Lazy-write" per-request logging in Django?

How to do "Lazy-write" per-request logging in Django? - django

Using Django, I need to do some per-request logging that involves database writes.
I understand Django's process_request() and process_response() middleware hooks, but as far as I can tell, those hooks are in the critical path (by design) for rendering the Web page response.
I'd prefer not to have my post-request database write operations holding up the response time for the page.
Is there a simple design pattern for Django that lets me do a "lazy log write", where I can use request hooks to gather information during request processing, but any follow-on operations and the actual log write operation does not occur until after the response is written to the user?
I'm using WSGI currently, but would prefer the most general solution possible.

Django implements a request_finished-signal which is fired after the processing of the response has been finished, but the down side is, it will not allow you to access the current request object which makes it not quite useful for logging... The latest place to hook into django's response processing is most probably in the HttpResponse class itself. You could eg. store the data temporarily in request.session and write them to the database in the close() method.
But I guess there are also other alternatives you should consider: You could use something like Celery to handle your logging tasks asynchronously. Furthermore there are non-sql databases like MongoDB that offer you handy and performant features for logging, eg. you dont have to wait until the changes are really committed to the database which can give you a big performance advantage.

Related

Django - How to store all the requests/responses with the least overhead?

I'm working on a Django middleware to store all requests/responses in my main database (Postgres / SQLite).
But it's not hard to guess that the overhead will be crazy, so I'm thinking to use Redis to queue the requests for an amount of time and then send them slowly to my database.
e.g. receiving 100 requests, storing them in database, waiting to receive another 100 requests and doing the same, or something like this.
The model is like this:
url
method
status
user
remote_ip
referer
user_agent
user_ip
metadata # any important piece of data related to request/response e.g. errors or ...
created_at
updated_at
My questions are "is it a good approach? and how we can implement it? do you have any example that does such a thing?"
And the other question is that "is there any better solution"?

This doesn't suit the concrete question/answer format particularly well, unfortunately.
"Is this a good approach" is difficult to answer directly with a yes or no response. It will work and your proposed implementation looks sound, but you'll be implementing quite a bit of software and adding quite a bit of complexity to your project.
Whether this is desirable isn't easily answerable without context only you have.
Some things you'll want to answer:
What am I doing with these stored requests? Debugging? Providing an audit trail?
If it's for debugging, what does a database record get us that our web server's request logs do not?
If it's for an audit trail, is every individual HTTP request the best representation of that audit trail? Does an auditor care that someone asked for /favicon.ico? Does it convey the meaning and context they need?
Do we absolutely need every request stored? For how long? How do we handle going over our storage budget? How do we handle in edge cases like the client hanging up before getting the response, or we've processed the request but crashed before sending a response or logging the record?
Does logging a request in band with the request itself present a performance cost we actually can't afford?
Compare the strengths and weaknesses of your approach to some alternatives:
We can rely on the web server's logs, which we're already paying the cost for and are built to handle many of the oddball cases here.
We can write an HTTPLog model in band with the request using a simple middleware function, which solves some complexities like "what if redis is down but django and the database aren't?"
We write an audit logging system by providing any context needed to an out-of-band process (perhaps through signals or redis+celery)
Above all: capture your actual requirements first, implement the simplest solution that works second, and optimize only after you actually see performance issues.

I would not put this functionality in my Django application. There are many tools to do that. One of them is NGINX, which is a reverse proxy server which you can put infront of Django. Then you can use the access log from NGINX. Also, you can format those logs according to your need. Usually for this big amount of data, it is better to not store them in database, because this data will rarely be used. You can store them in a S3 bucket or just in plain files and use a log parser tool to parse them.

Django API beyond simple data handling

I have a django application that deploys the model logic and data handling through the administration.
I also have in the same project a python file (scriptcl.py) that makes use of the model data to perform heavy calculations that take some time, per example 5 secs, to be processed.
I have migrated the project to the cloud and now I need an API to call this file (scriptcl.py) passing parameters, process the computation accordingly to the parameters and data of the DB (maintained in the admin) and then respond back.
All examples of the django DRF that I've seen so far only contain authentication and data handling (Create, Read, Update, Delete).
Could anyone suggest an idea to approach this?

In my opinion correct approach would be using Celery to perform this calculations asynchronous.

Write a class which inherits from DRF APIView which handles authentication, write whatever logic you want or call whichever function, Get the final result and send back the JsonReposen. But as you mentioned if the Api takes more time to respond. Then you might have to think of some thing else. Like giving back a request_id and hit that server with the request_id every 5seconds to get the data or something like that.

Just to give a feedback to this, the approach that I took was to build another API using flask and normal python scripts.
I also used sqlalchemy to access the database and retrieve the necessary data.

Should I implement revisioning using database triggers or using django-reversion?

We're looking into implementing audit logs in our application and we're not sure how to do it correctly.
I know that django-reversion works and works well but there's a cost of using it.
The web server will have to make two roundtrips to the database when saving a record even if the save is in the same transaction because at least in postgres the changes are written to the database and comitting the transaction makes the changes visible.
So this will block the web server until the revision is saved to the database if we're not using async I/O which is currently the case. Even if we would use async I/O generating the revision's data takes CPU time which again blocks the web server from handling other requests.
We can use database triggers instead but our DBA claims that offloading this sort of work to the database will use resources that are meant for handling more transactions.
Is using database triggers for this sort of work a bad idea?
We can scale both the web servers using a load balancer and the database using read/write replicas.
Are there any tradeoffs we're missing here?
What would help us decide?

You need to think about the pattern of db usage in your website.
Which may be unique to you, however most web apps read much more often than they write to the db. In fact it's fairly common to see optimisations done, to help scaling a web app, which trade off more complicated 'save' operations to get faster reads. An example would be denormalisation where some data from related records is copied to the parent record on each save so as to avoid repeatedly doing complicated aggregate/join queries.
This is just an example, but unless you know your specific situation is different I'd say don't worry about doing a bit of extra work on save.
One caveat would be to consider excluding some models from the revisioning system. For example if you are using Django db-backed sessions, the session records are saved on every request. You'd want to avoid doing unnecessary work there.
As for doing it via triggers vs Django app... I think the main considerations here are not to do with performance:
Django app solution is more 'obvious' and 'maintainable'... the app will be in your pip requirements file and Django INSTALLED_APPS, it's obvious to other developers that it's there and working and doesn't need someone to remember to run the custom SQL on the db server when you move to a new server
With a db trigger solution you can be certain it will run whenever a record is changed by any means... whereas with Django app, anyone changing records via a psql console will bypass it. Even in the Django ORM, certain bulk operations bypass the model save method/save signals. Sometimes this is desirable however.
Another thing I'd point out is that your production webserver will be multiprocess/multithreaded... so although, yes, a lengthy db write will block the webserver it will only block the current process. Your webserver will have other processes which are able to server other requests concurrently. So it won't block the whole webserver.
So again, unless you have a pattern of usage where you anticipate a high frequency of concurrent writes to the db, I'd say probably don't worry about it.

How can I scale a webapp with long response time, which currently uses django

I am writing a web application with django on the server side. It takes ~4 seconds for server to generate a response to the user. It makes use of a weather api. My application has to make ~50 query to that api for each user request.
Server side uses urllib of python for using the weather api. I used pythons threading to speed up the process because urllib is synchronous. I am using wsgi with apache. The problem is wsgi stack is fully synchronous and when many users use my application, they have to wait for one anothers request to finish. Since each request takes ~4 seconds, this is unacceptable.
I am kind of stuck, what can I do?
Thanks

If you are using mod_wsgi in a multithreaded configuration, or even a multi process configuration, one request should not block another from being able to do something. They should be able to run concurrently. If using a multithreaded configuration, are you sure that you aren't using some locking mechanism on some resource within your own application which precludes requests running through the same section of code? Another possibility is that you have configured Apache MPM and/or mod_wsgi daemon mode poorly so as to preclude concurrent requests.
Anyway, as mentioned in another answer, you are much better off looking at caching strategies to avoid the weather lookups in the first place, or offloading to client.

50 queries to an outside resource per request is probably a bad place to be, and probably not neccesary at all.
The weather doesn't change all that quickly, and so you can probably benefit enormously by just caching results for a while. Then it doesn't matter how many requests you're getting, you don't need to do more than a few queries per day
If that's not your situation, you might be able to get the client to do the work for you. Refactor the code so that the weather api aggregation happens on the client in javascript, rather than funneling it all through the server.
Edit: based on comments you've posted, what you are asking for probably cannot be optimized within the constraints of the API you are using. The problem is that the service is doing a good job of abstracting away the differences in the many sources of weather information they aggregate into a nearest location query. after all, weather stations provide only point data.
If you talk directly to the technical support people that provide the API, you might find that they are willing to support more complex queries (bounding box), for which they will give you instructions. More likely, though, they abstract that away because they don't want to actually reveal the resolution that their API actually provides, or because there is some technical reason in the way that they model their data or perform their calculations that would make such queries too difficult to support.
Without that or caching, you are just out of luck.

Asynchronous database update in Django?

I have a big form on my site. When the users fill it out and submit it, most of the data just gets dumped to the database, and then they get redirected to a new page. However, I'd also like to use the data to query another site, and then parse the results. That might take a bit longer. It's not essential that the user sees these results right away, so I was wondering if it's possible to asynchronously call a function that will handle this, and then return an HttpResponse from my view like usual without making them wait?
If so... how? Any particular libraries I should look at?

User RabbitMQ and Celery with django. If you are deployed on EC2, also look at SQS
You create a message from the request-response cycle and an alternative process or a cron keeps checking off the messages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js