Running multiple schedulers in Airflow 2.0 on same machine - airflow-scheduler

I understand that Airflow 2.0 now supports running multiple schedulers concurrently for HA. Can I run multiple schedulers on the same machine (VM)? Do I do it just by running airflow scheduler twice if I want 2 schedulers, without configuring anything else?

Can I run multiple schedulers on the same machine (VM)?
Yes, you can.The second scheduler can run with like this:
airflow scheduler --skip-serve-logs
Can I run multiple schedulers on the differnt machine (VM)?
Yes, you can.I do create second VM to run multi schedulers.
Check these Dependencies
Set use_row_level_locking config to True(default is True).
Check Your backend database's version.
Make Sure All scheduler's database config point to same databse.
After check these Dependencies.You can run multi schedulers on different machines.
After I start two schedulers on different VM, I run a task to check whether this task would be executed twice.Fortunately, only one scheduler get this task.
If you run only one webserver, you would loss some task's log when these task is executed on other scheduler machine. At this situation, you would need log collector such as elastic-search.

Related

Airflow DAGs are queued up

I am working on a project where I can see all of the dags are queued up and not moving (appx over 24H or more)
Looks like its scheduler is broken but I need to confirm that.
So here are my questions
How to see if scheduler is broken
How to reset my airflow (web server) scheduler?
Expecting some help regarding how to reset airflow schedulers
The answer will depend a lot on how you are running Airflow (standalone, in Docker, Astro CLI, managed solution...?).
If your scheduler is broken the Airflow UI will usually tell you the time since the last heartbeat like this:
There is also an API endpoint for a scheduler health check at http://localhost:8080/health (if Airflow is running locally).
Check the scheduler logs. By default they are in a file at $AIRFLOW_HOME/logs/scheduler.
You might also want to look at how to do health checks in Airflow in general.
In terms of resetting it is usually best to restart the scheduler and again this will depend on how you started it in the first place. If you are using a standalone instance and have the processes in the foreground simply do ctr+c or close the terminal to stop it. If you are running airflow in docker restart the container, for the Astro CLI there is astro dev restart.

The scheduler seems to be running under uWSGI, but threads have disabled.You must run uWSGI with the --enable-threads option for the scheduler to work

I'm deploying django app to pythonanywhere where i used APScheduler for automatically send expire mail whenever subscription end date exceed.
I don't know how to enable threads, so that my web app runs perfectly on pythonanywhere.
On hosting platforms like PythonAnywhere, there might be multiple copies of your site running at different times, in order to serve the traffic that you get. So you should not use an in-process scheduler to perform periodic tasks; instead, you should use the platform's built-in scheduled tasks function.

How does Heroku determines the number of web processes to run per dyno?

I'm using Heroku to host a django application, and I'm using Waitress as my web server.
I run 2 (x2) dynos, And I see in New Relic instance tab that I have 10 instances running.
I was wondering How does Heroku determines the number of web server processes to run on one Dyno when using Waitress?
I know that when using Gunicorn there is a way to set the number of proccess per dyno, but didn't see any way to define it in Waitress.
Thanks!
In Waitress, there is a master process and (by default) 4 worker threads or processes. You can change this if you wish. Here is the docs for these options for waitress-serve
http://waitress.readthedocs.org/en/latest/runner.html#runner
--threads=INT
Number of threads used to process application logic, default is 4.
So if you have 2 dynos, and 5 (4+1) threads on each, then the total would come to 10 instances for this app in the RPM dashboard.
One can add more processes to the dynos as the maximum supported on Heroku 2x dynos is much higher:
2X dynos support no more than 512
https://devcenter.heroku.com/articles/dynos#process-thread-limits
But, you may want to check out some discussion on tuning this vs Gunicorn:
Waitress differs, in that it has an async master process that buffers the
entire client body before passing it onto a sync worker. Thus, the server
is resilient to slow clients, but is also guaranteed to process a maximum
of (default) 4 requests at once. This saves the database from overload, and
makes scaling services much more predictable.
Because waitress has no external dependencies, it also keeps the heroku
slug size smaller.
https://discussion.heroku.com/t/waitress-vs-gunicorn-for-docs/33
So after talking to the New relic support they clarified the issue.
Apparently only processes are counted in the instances tab (threads do not count).
in my Procfile I am also monitoring RabbitMQ workers which add instances to the instance tab, and hence the mismatch.
To quote their answer :
I clarified with our developers how exactly we measure instances for the Instances tab. The Python agent views each monitored process as one instance. Threads do not count as additional instances. I noticed that you're monitoring not only your django/waitress app, but also some background tasks. It looks like the background tasks plus the django processes are adding up to that total of 10 processes being monitored.

Not sure if I should use celery

I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.

Django Celery in production

I have everything I want to do with django-celery working on my development machine locally. I run Django, djcelery, cellery and the broker (Amazon SQS). It sends tasks and just works.
I can set this all up like I have done locally (i.e. all on one
machine), but what happens when I want to distribute tasks to another
machine/share tasks etc.? Is this a copy of the current machine (with Django, djcelery and celery) and all connection to the same SQS? How does this work? If they all connection to the same broker do they just 'know'? or does it not work like this?
Is it ok to start off with everything on one machine like I did in development (I will daemonize celery in production)?
Amazon SQS is a Simple Queueing Service, jobs go in wait to be run and then removed from the queue once complete. Celery simply reads off of this queue.
Celery can scale both horizontally and vertically. You need celery to process more jobs faster? Give your machine more resources, up the worker count, thats vertical scaling, or boot more smaller machines which is horizontal scaling. Either way your celery workers are all consuming the same queue on SQS. It does depend on what your celery jobs are doing as to how the rest of your infrastructure will be affected. If they are writing to a DB the more workers you have the higher the load on your DB so you would need to look at scaling that too.
It is OK to start off with the "all" on one machine approach. As the demand on your app grows you can start looking at moving celery workers off to more machines or give your all in one server more resources.
Does this help? :)