I'm using Django+celery for my 1st ever web development project, and rabbitmq as the broker. My celery workers are running on a different system from the web server and are executing long-running tasks. During the task execution, the task output will be dumped to local log files on the workers. I'd like to display these task log files through the web server so the user can know in real-time where the execution is, but I've no idea how I should transfer these log files between the workers and the system where the web server is. Any suggestion is appreciated.
Do not move logs, just log to the same place. It can be really any database (relational or non-relational) accessible from the web server and Celery workers. You can even create (or look for) appropriate python logging handler, saving log records to the centralized storage.
Maybe the solution isn't to move the logs, but to aggregate them. Take a look at some logging tools like splunk, loggly or logscape.
Related
I am working on a project where I can see all of the dags are queued up and not moving (appx over 24H or more)
Looks like its scheduler is broken but I need to confirm that.
So here are my questions
How to see if scheduler is broken
How to reset my airflow (web server) scheduler?
Expecting some help regarding how to reset airflow schedulers
The answer will depend a lot on how you are running Airflow (standalone, in Docker, Astro CLI, managed solution...?).
If your scheduler is broken the Airflow UI will usually tell you the time since the last heartbeat like this:
There is also an API endpoint for a scheduler health check at http://localhost:8080/health (if Airflow is running locally).
Check the scheduler logs. By default they are in a file at $AIRFLOW_HOME/logs/scheduler.
You might also want to look at how to do health checks in Airflow in general.
In terms of resetting it is usually best to restart the scheduler and again this will depend on how you started it in the first place. If you are using a standalone instance and have the processes in the foreground simply do ctr+c or close the terminal to stop it. If you are running airflow in docker restart the container, for the Astro CLI there is astro dev restart.
I'm trying to distribute Django and Celery.
I've created a small project with Django and Celery. Django will request a Celery Worker to work on some data on the database. Then the data is passed back to Django.
My idea is that:
Django stack installed on one server
Message queue (RabbitMQ) on one server
Celery worker on one server
Hence 3 Servers in Total
However, the problem is celery has to use some code from Django, for example models, because it accesses the model. Hence, it would also require settings.py file to know what are the servers.
Does this mean that for #3, I would need to install Django and Celery on the server, but disable Django and only run celery? For example celery -A PROJECT_NAME worker -l INFO, but without an Apache Server for Django?
If you want your celery workers to operate on a different server, you need to make sure that all the resources required by the worker are accessible from that server.
For example, if you have a simple task, you can copy only the code required for that task to the server. If your worker needs any other resources like some other code, files, db you need to make sure it has access.
Really, if you want to have two servers working on the same tasks, you will have to use a simple web interface (such as Flask) to communicate between the servers (and extend the functionality of your queue). Then, you will have to ensure they are both using the same data source.
Consider hosting your database remotely, or have the remote server access the database remotely. Either way, any workers running on a server will need access to the database and all source code necessary to complete the task. Then, you must simply have the two servers share a messaging queue.
Source: how to configure and run celery worker on remote system
I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.
I'm having a really hard time conceptualising how I can connect to the the twitter streaming api and process tweets via an admin interface provided by Django.
The main problem is starting a daemon from Django and having the ability to stop/start it, plus making sure there is provision for monitoring. I don't really want to use upstart for this purpose because I want to try and keep the project as self contained as possible.
I'm currently attempting the following and am unsure if it's perhaps the wrong way to go about things
Start a celery task from Django which establishes a persistent connection to the streaming API
The above task creates subtasks which will process tweets and store them
Because celeryd runs as a daemon it will automatically run the first task again if the connection breaks and the task fails - does this mean I don't need any additional monitoring?
Does the above make sense or have I misunderstood how celery works?
After reading a lot of blogposts, I decided to switch from crontab to Celery for my middle-scale Django project. I have a few things I didn't understand:
1- I'm planning to start a micro EC2 instance which will be dedicated to RabbitMQ, would this be sufficient for a small-to-medium heavy tasking? (Such as dispatching periodical e-mails to Amazon SES).
2- Computing of tasks, does compution of tasks occur on the Django server or the rabbitMQ server (assuming the rabbitMQ is on a seperate server)?
3- When I need to grow my system and have 2 or more application servers behind a load balancer, do these two celery machines need to connect to the same rabbitMQ vhost? Assuming application servers are the carbon copy and tasks are same and everything is sync on the database level.
I don't know the answer to this question, but you can definitely configure it to be suitable (e.g. use -c1 for a single process worker to avoid using much memory, or eventlet/gevent pools), see also the --autoscale option. The choice of broker transport also matters here, the ones that are not polling are more CPU effective (rabbitmq/redis/beanstalk).
Computing happens on the workers, the broker is only responsible for accepting, routing and delivering messages (and persisting messages to disk when necessary).
To add additional workers these should indeed connect to the same virtual host. You would
only use separate virtual hosts if you would want applications to have separate message buses.