Airflow DAGs are queued up - amazon-web-services

I am working on a project where I can see all of the dags are queued up and not moving (appx over 24H or more)
Looks like its scheduler is broken but I need to confirm that.
So here are my questions
How to see if scheduler is broken
How to reset my airflow (web server) scheduler?
Expecting some help regarding how to reset airflow schedulers

The answer will depend a lot on how you are running Airflow (standalone, in Docker, Astro CLI, managed solution...?).
If your scheduler is broken the Airflow UI will usually tell you the time since the last heartbeat like this:
There is also an API endpoint for a scheduler health check at http://localhost:8080/health (if Airflow is running locally).
Check the scheduler logs. By default they are in a file at $AIRFLOW_HOME/logs/scheduler.
You might also want to look at how to do health checks in Airflow in general.
In terms of resetting it is usually best to restart the scheduler and again this will depend on how you started it in the first place. If you are using a standalone instance and have the processes in the foreground simply do ctr+c or close the terminal to stop it. If you are running airflow in docker restart the container, for the Astro CLI there is astro dev restart.

Related

How to get notified when droplet reboots and when droplet finishes boot?

I found this answer https://stackoverflow.com/a/35456310/80353 and it recommends either API or using user_data which actually is cloud-init underneath.
I can think of several ways to possibly get notified that a server is up:
detect droplet status via API
I notice that the status never changes during reboot so I guess this is out.
using DigitalOcean native monitoring agent
The monitoring agent seems to only cover resource utilisation. No alert when the server is being rebooted or finishes booting up
using cloud-init
This answer https://stackoverflow.com/a/35456310/80353 I mentioned earlier uses wget to send signals out. I can possibly use wget for every time the droplet finishes booting up using bootcmd in cloud-init. But not for reboot.
There's also the issue of how to ensure the wget request from the right DigitalOcean droplet can correctly identify itself to my server.
Any advice on how to accomplish getting notifications at my server whenever a droplet reboots or finishes booting up?
cloud-init
bootcmd actually runs every time. Check out the module frequency key in the docs
Another module you might consider for this is phone home.
Systemd
Since the OP is looking for notifications on shutdown/reboot as well, cloud-init is probably not the best for a single solution since it handles boot/init primarily. Assuming systemd:
This post discusses setting up a service to run on shutdown.
This post discusses setting up a service to run on startup.

Running multiple schedulers in Airflow 2.0 on same machine

I understand that Airflow 2.0 now supports running multiple schedulers concurrently for HA. Can I run multiple schedulers on the same machine (VM)? Do I do it just by running airflow scheduler twice if I want 2 schedulers, without configuring anything else?
Can I run multiple schedulers on the same machine (VM)?
Yes, you can.The second scheduler can run with like this:
airflow scheduler --skip-serve-logs
Can I run multiple schedulers on the differnt machine (VM)?
Yes, you can.I do create second VM to run multi schedulers.
Check these Dependencies
Set use_row_level_locking config to True(default is True).
Check Your backend database's version.
Make Sure All scheduler's database config point to same databse.
After check these Dependencies.You can run multi schedulers on different machines.
After I start two schedulers on different VM, I run a task to check whether this task would be executed twice.Fortunately, only one scheduler get this task.
If you run only one webserver, you would loss some task's log when these task is executed on other scheduler machine. At this situation, you would need log collector such as elastic-search.

Apache Ambari manual service start

I have HDFS-HA(namenode high availability) setup in my hadoop cluster(using Apache Ambari).
Now, I have one scenario in which my ambari-server machine(which also consist one Namenode i.e. active/Primary) went offline so that my other Namenode(Standby) was active and running but after some time it went offline too for some reason.Services were offline I mean.I was unable to do any operation.What if I have to start the services manually that is used to start using ambari.
I mean using command-line or something
Services can be started from the command line but they should not be in an Ambari environment typically. This is because Ambari does more then just start the service when you issue the start/restart command for any given service. Ambari also makes sure the most up to date configuration is written to each node along with other various house keeping type tasks.
You can look at the logs in Ambari when you start/restart a service to see exactly what Ambari does with respect to writing the configuration, other house keeping, and the exact command to start/restart the given service.

Django+celery: how to show worker logs on the web server

I'm using Django+celery for my 1st ever web development project, and rabbitmq as the broker. My celery workers are running on a different system from the web server and are executing long-running tasks. During the task execution, the task output will be dumped to local log files on the workers. I'd like to display these task log files through the web server so the user can know in real-time where the execution is, but I've no idea how I should transfer these log files between the workers and the system where the web server is. Any suggestion is appreciated.
Do not move logs, just log to the same place. It can be really any database (relational or non-relational) accessible from the web server and Celery workers. You can even create (or look for) appropriate python logging handler, saving log records to the centralized storage.
Maybe the solution isn't to move the logs, but to aggregate them. Take a look at some logging tools like splunk, loggly or logscape.

Django Celery in production

I have everything I want to do with django-celery working on my development machine locally. I run Django, djcelery, cellery and the broker (Amazon SQS). It sends tasks and just works.
I can set this all up like I have done locally (i.e. all on one
machine), but what happens when I want to distribute tasks to another
machine/share tasks etc.? Is this a copy of the current machine (with Django, djcelery and celery) and all connection to the same SQS? How does this work? If they all connection to the same broker do they just 'know'? or does it not work like this?
Is it ok to start off with everything on one machine like I did in development (I will daemonize celery in production)?
Amazon SQS is a Simple Queueing Service, jobs go in wait to be run and then removed from the queue once complete. Celery simply reads off of this queue.
Celery can scale both horizontally and vertically. You need celery to process more jobs faster? Give your machine more resources, up the worker count, thats vertical scaling, or boot more smaller machines which is horizontal scaling. Either way your celery workers are all consuming the same queue on SQS. It does depend on what your celery jobs are doing as to how the rest of your infrastructure will be affected. If they are writing to a DB the more workers you have the higher the load on your DB so you would need to look at scaling that too.
It is OK to start off with the "all" on one machine approach. As the demand on your app grows you can start looking at moving celery workers off to more machines or give your all in one server more resources.
Does this help? :)