django with heavy computation and long runtime - offline computation and send results - django

I have a django app where the user sends a request, and the server does some SQL lookup, followed by computation on results and finally showing the results to the user.
The SQL lookup and the computation afterwards can take a long time, maybe 30+ minutes. I have seen some webpages ask for email in such cases then send you the URL later. But I'm not sure how this can be done in django or whether there are other options for this situation. Any pointer will be very helpful.
(I'm sorry but as I said it's a rather general question, I don't know how can I provide a min runnable code for this)

One way to accomplish this would be to use something like Celery, which is a distributed task queue. The processing task would go into the queue (synchronously or asynchronously), and it would call a function to send an email to the user alerting them it is ready when the task is complete.
Documentation: https://docs.celeryproject.org/en/stable/django/first-steps-with-django.html

Related

Returning the result of celery task to the client in Django template

So I'm trying to accomplish the following. User browses webpage and at the sime time there is a task running in the background. When the task completes it should return args where one of args is flag: True in order to trigger a javascript and javascript shows a modal form.
I tested it before without async tasks and it works, but now with celery it just stores results in database. I did some research on tornado-celery and related stuff but some of components like tornado-redis is not mantained anymore so it would not be vise in my opinion to use that.
So what are my options, thanks?
If I understand you correctly, then you want to communicate something from the server side back to the client. You generally have three options for that:
1) Make a long pending request to the server - kinda bad. Jumping over the details, it will bog down your web server if not configured to handle that, it will make your site score low on performance tests and if the request fails, everything fails.
2) Poll the server with numerous requests with a time interval (0.2 s, something like that) - better. It will increase the traffic, but the requests will be tiny and will not interfere with the site's performance very much. If you instate a long interval to not load the server with pointless requests, then the users will see the data with a bit of a delay. On the upside this will not fail (if written correctly) even if the connection is interrupted.
3) Websockets where the server can just hit the client with any message whenever needed - nice, but takes some time to get used to. If you want to try, you can use django-channels which is a nice library for Django websockets.
If I did not understand you correctly and this is not the problem at hand and you are figuring how to get data back from a Celery task to Django, then you can store the Celery task ID-s and use the ID-s to first check, if the task is completed and then query the data from Celery.

How to update progress bar while making a Django Rest api request?

My django rest app accepts request to scrape multiple pages for prices & compare them (which takes time ~5 seconds) then returns a list of the prices from each page as a json object.
I want to update the user with the current operation, for example if I scrape 3 pages I want to update the interface like this :
Searching 1/3
Searching 2/3
Searching 3/3
How can I do this?
I am using Angular 2 for my front end but this shouldn't make a big difference as it's a backend issue.
This isn't the only way, but this is how I do this in Django.
Things you'll need
Asynchronous worker procecess
This allows you to do work outside the context of the request-response cycle. The most common are either django-rq or Celery. I'd recommend django-rq for its simplicity, especially if all you're implementing is a progress indicator.
Caching layer (optional)
While you can use the database for persistence in this case, temporary cache key-value stores make more sense here as the progress information is ephemeral. The Memcached backend is built into Django, however I'd recommend switching to Redis as it's more fully featured, super fast, and since it's behind Django's caching abstraction, does not add complexity. (It's also a requirement for using the django-rq worker processes above)
Implementation
Overview
Basically, we're going to send a request to the server to start the async worker, and poll a different progress-indicator endpoint which gives the current status of that worker's progress until it's finished (or failed).
Server side
Refactor the function you'd like to track the progress of into an async task function (using the #job decorator in the case of django-rq)
The initial POST endpoint should first generate a random unique ID to identify the request (possibly with uuid). Then, pass the POST data along with this unique ID to the async function (in django-rq this would look something like function_name.delay(payload, unique_id)). Since this is an async call, the interpreter does not wait for the task to finish and moves on immediately. Return a HttpResponse with a JSON payload that includes the unique ID.
Back in the async function, we need to set the progress using cache. At the very top of the function, we should add a cache.set(unique_id, 0) to show that there is zero progress so far. Using your own math implementation, as the progress approaches 100% completion, change this value to be closer to 1. If for some reason the operation fails, you can set this to -1.
Create a new endpoint to be polled by the browser to check the progress. This looks for a unique_id query parameter and uses this to look up the progress with cache.get(unique_id). Return a JSON object back with the progress amount.
Client side
After sending the POST request for the action and receiving a response, that response should include the unique_id. Immediately start polling the progress endpoint at a regular interval, setting the unique_id as a query parameter. The interval could be something like 1 second using setInterval(), with logic to prevent sending a new request if there is still a pending request.
When the progress received equals to 1 (or -1 for failures), you know the process is finished and you can stop polling
That's it! It's a bit of work just to get progress indicators, but once you've done it once it's much easier to re-use the pattern in other projects.
Another way to do this which I have not explored is via Webhooks / Channels. In this way, polling is not required, and the server simply sends the messages to the client directly.

Sustain an http connection while django processes a big request (20mins+)

I've got a django site that is producing a csv download. The content of the csv is dictated by user defined parameters. It's possible that users will set parameters that require significant thinking time on the server. I need a way of sustaining the http connection so the browser doesn't kick up an error message. I heard that it's possible to send intermittent http headers to do this. Can anyone point me in the right direction to set this up on a django site?
(unfortunatly I'm stuck with the possibility of slow reports - improving my sql won't mitigate this)
Don't do it online. Trigger an offline task, use a bit of Javascript to repeatedly call a view that checks if the task has finished, and redirect to the finished file when it's ready.
Instead of blocking the user and it's browser for 20 minutes (which is not a good idea) do the time-consuming task in the background. When the task will finish and generate the result simply notify the user so that he/she will just need to download the ready result.

Task queue in Django

I've only heard about tools like Celery, but I don't know if it fits my needs and is the best solution I can have.
Imagine a game like Travian. We initiate building and we have to wait N seconds until the construction is finished. When and how should we complete the construction?
Solution 1: Check if there are active construction every time the page loads. If queries like that takes some time we can make them asynchronous. If there are some - then complete.
However, in this way we are constantly waiting for the user to reload the page. Sure, we can use cronjob to check for constructions to be completed from time to time, but cronjobs execute once in a minute or less often. Constructions / attacks etc. must be executed as precisely as possible.
The solution above works, but has some cons. What are better and RELIABLE ways to perform actions like those I mentioned.
Moreover, let's assume that resources needs to be regenerated at X per hour speed and we need to regenerate them very precisely and pretty often. How can I achieve this without waiting for the page to be refreshed?
Finally, solution shall work in Webfaction hosting or any other shared hosting. I've heard that Celery doesn't work in Webfaction or am I mistaken?
Yes, celery have periodic tasks with seconds:
http://celery.readthedocs.org/en/latest/userguide/periodic-tasks.html
Also you can run tasks in time with celery's crontab
http://celery.readthedocs.org/en/latest/userguide/periodic-tasks.html#crontab-schedules
Also if you need to check resources count I think it's common part for every request, so your response should looks like
{
"header": {"resources": {"wood":1, "stone":500}}
"data": {.. you real data shoud be here...}
}
You need to add header to response that will contain common information like resources count, unread messages etc and handle it properly on client.
To improve it you can use nginx + ssl + memcache backend.

Django/Postgres performance worsening after repeatedly processing the same query

I am running Django on Apache. I have several client computers which should call urllib2.urlopen() and send over some data which my server will process and immediately send back a reply. However, when I am testing this I found a very tricky issue. I have one client repeatedly send the same data to be processed. The first time, it takes around ~20 seconds, second time, it takes about 40 seconds, third time I get a 504 (gateway timeout) error. If I try to send data some more 504 errors randomly pop up. I am pretty sure this is an issue with Postgres as the function that processes the information makes many database calls, however, I do not know why the performance of Postgres would decline so much. I have tried several database optimization tricks, including this one: (http://stackoverflow.com/questions/1125504/django-persistent-database-connection), to no avail.
Thanks in advance.
Edit: The requests are not coming concurrently. They are coming in back to back and each query involves a lot of SELECTs and JOINs, and there are a few INSERTs and UPDATEs as well. The apache error logs show that it is just a simple timeout, where the function to process the client posted data takes over 90 seconds.
If it's really Postgres, then you should turn on the logging of slow statements in the Postgres configuration to find out which statement exactly is taking so much time.
This can be done by setting the configuration property log_min_duration.
Details are in the manual:
http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT
You say the function makes "many database calls" so I'd start with a very low number, or even 0 to log the duration of all statements, then you might be able to identify the slow ones.
It could also be a locking issued. Maybe the first call does not end its transaction properly and subsequent calls run into a timeout when waiting for a resource.
You can verify this by checking the system view pg_locks after the first call.
Have you checked the Apache error_logs? Have you set django DEBUG = True or ADMINS = ('email#addr.com',) so you can get a detailed error report about what the actual cause of the issue is? If so, how about pasting some information here.
Why are you certain that it's postgres? Have you done diagnostics to come to that conclusion? If so, please let us know.
Are you running apache with mod_wsgi? How many processes and threads have you allocated to your django application?
Also, 20 seconds to process the first transaction is a huge amount of time. Perhaps you could show us the view code that is causing the time out. We may be able to help there.
I sincerely doubt that it's going to be postgres alone that is causing the issue. It probably has something to do with application code, or server configuration.