Where does Django-celery/RabbitMQ store task results? - django

My celery database backend settings are:
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = "mysqlite.db"
I am using RabbitMQ as my messager.
It doesn't seem like any results are getting stored in the db, and yet I can read the results after the task is complete. Are they in memory or a RabbitMQ cache?
I haven't tried reading the same result multiple times, so maybe its a read once then poof!

CELERY_RESULT_DBURI is for the sqlalchemy result backend, not the Django one.
The Django one always uses the default database configured in the DATABASES setting (or the DATABASE_* settings if on older Django versions)

my celery daemons work just fine, but I'm having difficulties with collecting task results. task_result.get() leads a timeout. and task.state is always PENDING..(but jobs are completed) i tried separate sqlite dbs, a single postgres db shared by workers. but i still cant get results. CELERY_RESULT_DBURI seems useless to me (for celery 2.5 )i think it's a newer configuration. Any suggestions are welcomed...
EDIT: it's all my fault:
i give extra parameters to my tasks with decorators, ignore_results=True Parameter create this problem. I deleted this key and it works like a charm :)

Related

What is the best way to transfer large files through Django app hosted on HEROKU?

HEROKU gives me H12 error on transferring the file to an API from my Django application (Understood it's a long running process and there is some memory/worker tradeoff I guess so). I am on one single hobby Dyno right now.
The function just runs smoothly for around 50MB file. The file itself is coming from a different source ( requests python package )
The idea is to build a file transfer utility using Django app on HEROKU. The file will not gets stored in my app side. Its just getting from point A and sending to point B.
Went through multiple discussions along with the standard HEROKU documentations, however I am struggling in between in some concepts:
Will this problem be solved by background tasks really? (If YES, I am finding explanation of the process than the direct way to do it such that I can optimize my flow)
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
Have you looked at using celery to run this as a background task?
This is a very standard way of dealing with requests which take a large time to complete.
Will this problem be solved by background tasks really? ( If YES, I am finding explanation of the process than the direct way to do it such that I can optimise my flow )
Yes it can be solved by background tasks. If you are using something like Celery which has direct support for django, you will be running another instance of your Django application but with a different startup command for Celery. It then keeps reading for new tasks to execute and reads the task name from the redis queue (or rabbitmq - whichever you use as the broker) and then executes that task and updates the status back to redis (or the broker you use).
You can also use flower along with celery so that you have a dashboard to see how many tasks are being executed and what are their statuses etc.
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
To use background task with Celery you will need to set up some sort of message broker like Redis or RabbitMQ
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
I dont think that would help for your use case
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
When you use celery, you will have to start few workers for that Celery instance, these workers are the ones who execute your background tasks. Celery documentation will help you with exact count calculation of these workers based on your instance CPU and memory etc.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
I have worked on few projects where we used Celery with background tasks to upload large files. It has worked well for our use cases.
Here is my final take on this after full evaluation, trials and earlier recommendations made here, thanks #arun.
HEROKU needs a web worker to deliver the website runtime which hold 512MB of memory, operations your perform if are below this limits should be fine.
Beyond that let's say you have scenarios like mentioned above where a large file is coming from one source api and going into another target api with Django app, you will have to:
First, you will have to run the file upload function as a background process since it will take time more than 30 seconds to respond which HEROKU expects to return. If not H12 Error is waiting for you. Solution to this is implementing Django Background tasks, Celery worked in my case. So here Celery is your same Django app functionality running as a background handler which needs its own app Dyno ( The Worker ) This can be scaled as needed in future.
To make your Django WSGI ( Frontend App ) talk to the Celery ( Background App), you need a message broker in between which can be HEROKU Redis, RabbitMQ, etc.
Second, the problems doesn't gets solved here even though you have a new Worker dedicated for the Celery app, the memory limits will still apply as its also a Dyno with its own memory.
To overcome this, your Python requests module should download the file in stream instead of direct downloading complete file in a single memory buffer. Iterate and load the stream data in chunks and send the file chunks to target endpoint.
Even chunk size plays here an important role. I will not put exact number here since it depends on various factors:
Should not be too small, else it will take more time to transfer.
Should not be too big to be handled by either of the source/target endpoint servers.

Access to Django ORM from remote Celery worker

I have a Django application and a Celery worker - each running on it's own server.
Currently, Django app uses SQLite to store the data.
I'd like to access the database using Django's ORM from the worker.
Unfortunately, it is not completely clear to me; thus I have some questions.
Is it possible without hacks/workarounds? I'd like to have a simple solution (I would not like to implement REST interface to object access). I imagine that achieving this could be done if I started using PostgreSQL instance which is accessible from both servers.
Which project files (there's just Django + tasks.py file) are required on the worker's machine?
Could you provide me with an example or tutorial? I tried looking it up but found just tutorials/answers bound to a problem of local Celery workers.
I have been searching for ways to do this simply but... Your best option is to attached a kind of callback to the task function that will call another function on the django server to carry out the database update

Celery sqlite result database not being created w/ Django, rabbitmq, & sqlalchemy

I have Celery set up with Django. I'm using RabbitMQ as my broker. I'm trying to set up sqlalchemy as my result back-end with a sqlite database separate from the Django database. I have RabbitMQ, Django, & Celery all running without any issues. I put in my settings.py CELERY_RESULT_BACKEND = 'db+sqlite:///celery_results.sqlite3' & on Celery worker startup it shows the back-end configured correctly in the log output.
The problem is that my database isn't being created. Why is this happening?
Hopefully this will save someone some head scratching or ruining their configuration. Everything was set up correctly and running fine. The issue was that I didn't have any tasks running since I wanted as little going on as possible during setup. Once I ran a task the sqlite database & tables were created successfully. I was also able to write a script to confirm that the task results were being stored in the database.

Alternative to django-celery to perform asynchronous task in Django?

In my admin I have a form allowing to upload a file to fill the DB.
Parsing and filling the DB take a long time, so I'd like to do it asynchronously.
As recommended by several SO users I tried to install python-celery, but I can't manage to do it (I'm on Webfaction).
Is there any simple, easy-to-install alternative?
If webfaction supports cron jobs you can create your own pseudo broker. You could save your long running tasks to the db and in a 'tasks' table, this would allow you to return a response to the user instantaneously. Then there could be a cron that goes through very frequently and looks for uncompleteled tasks and processes them.
I believe this is what django mailer does
https://github.com/jtauber/django-mailer/
https://stackoverflow.com/a/1419640/594589
Try Gearman along with it's python client library
It's very easy to setup and run gearman. Try few examples.

haystack's RealTimeSearchIndex causes django to hang on data entry

I'm using django-haystack and a xapian backend with real time indexing (haystack.indexes.RealTimeSearchIndexing) of model data and it works fine on my Ubuntu server. However, it causes django to hang upon data entry when I deployed the app on a RHEL5 server.
Everything is hunky dory if I switch to a standard SearchIndex.
Running ./manage.py rebuild_index manually works fine too.
The major differences between the two setups would be the versions of Python (2.4.3 vs 2.6.4) and the xapian (1.0.4-1 vs 1.0.15).
Any suggestions on what may be the problem?
Nothing interesting appears in the logs, and I've tried different databases (mysql, sqlite3) and deployment methods (mod_python, wsgi) with no luck yet.
I have noted the warning on the haystack docs stating that RealTimeSearchIndex is only handled gracefully with a Solr backend, however I'm running a very low traffic site with only occasional writes so I'm fine with some CPU overheads on writes.
Installing xapian-core and xapian-bindings from source solved the problem.
I initially used the RPM packages provided here.
Please note this from the author of xapian-haystack:
Because Xapian does not support simultaneous WritableDatabase connections, it is strongly recommended that users take care when using RealTimeSearchIndex to either set WSGIDaemonProcess processes=1 or use some other way of ensuring that there are not multiple attempts to write to the indexes. Alternatively, use SearchIndex and a cronjob to reindex content at set time intervals (sample cronjob can be found here http://gist.github.com/216247) or derive your own SearchIndex to implement some other form of keeping your indexes up to date.