Apache Airflow scheduler is not running after some time - airflow-scheduler

I am running a complex flow in apache airflow and using local executor with postgres db. It is running for tasks and scheduler goes down after some time. In airlfow console cant see any logs
using airflow - puckel/docker-airflow:1.10.9 deployed in openshift environment
Error in airflow UI:
The scheduler does not appear to be running. Last heartbeat was received 3 hours ago.
The DAGs list may not update, and new tasks will not be scheduled.

Related

Is there a way to verify if celery is up in Google cloud as all the jobs are going to queue status

Is there a way to verify in Gcp if the celery is up or down as all our airflow jobs are going into queue status and not getting executed. (airflow is running in google cloud)
Airflow has flower built in which can be used to monitor celery:
Flower is a web based tool for monitoring and administrating Celery clusters.

How Cloud Run behaves with things that are running in my application during the deploy of a new service revision?

I'm migrating a PHP web application that currently runs on Compute Engine to Cloud Run. Currently, this platform schedules the execution of some PHP scripts in the form of cron jobs.
Let's say that I plan to use Cloud Scheduler to schedule requests to some of these PHP scripts after migrating to Cloud Run. My question is related to how Cloud Run will behave if any of these PHP scripts happen to be running during the end of a new deploy of a new service revision, would the deploy of a new revision kill the script execution (triggered by Cloud Scheduler request) in progress?
Also, I would like to know how Cloud Run behaves with (any) requests in progress during a new service revision deploy. Maybe both of my questions are related/connected.
(Maybe I am wrong when I think that the deploy of a new revision will immediately kill everything running and every request in progress to the service.)
When you deploy a new revision, the new request are routed to the new revision. The currently running request continue on the existing instances of the previous revisions. When there is no active request on an instance of the old revision, it will be deleted after a while (about 15 minutes today).
So, the 2 questions are related. But a remarks: If you run PHP script with Cloud Scheduler, the HTTP request that you perform must stay active up to the end of the script. If you send a response in your PHP script before the end on the processing, firstly the CPU will be throttle and you script will be very very very slow. And secondly, Cloud Run service will consider the instance as inactive (not serving active request) and can delete it as it wants.

How to set up a long running Django command in Google Cloud Platform

I have recently moved my site to Google Cloud Run.
The problem is I also need to move a couple of cron jobs that run a Django command every day inside a container. What is the preferred way of doing this if I don't want to pay for a full Kubernetes cluster with always running node instances?
I would like the task to run and then spin the server down, just as Cloud Run does when I get an incoming request. I have searched through all the documentation, but I am having trouble in finding the correct solution for long running tasks inside containers that do not require an underlying server in Google Cloud.
Can someone point me in the right direction?
Cloud Run request timeout limit is 15 minutes.
Cloud Functions function timeout limit is 540 seconds.
For long-running tasks spinning up and down Compute Instance when needed would be more preferred option.
An example of how to schedule, run and stop Compute Instances automatically is nicely explained here:
Scheduling compute instances with Cloud Scheduler
In brief: Actual instance start / stop is performed by Cloud Functions. Cloud Scheduler on timetable publishes required tasks to Cloud Pub/Sub queue which triggers these functions. Your code at the end of main logic can also publish message to Cloud Pub/Sub to run Stop this instance task.
How to process task in Django?
it can be same django app started with wsgi server to process incoming requests (like regular django site) but wth increased request / response / other timeouts, long wsgi worker life ... - in this case task is regular http request to django view
it can be just one script (or django management command) run at cloud instance startup to just automatically execute one task
you may also want to pass additional arguments for the task, in this case you can publish to Cloud Pub/Sub one Start instance task, and one main logic task with custom arguments and make your code pull from Pub/Sub first
more django-native - use Celery and start celery worker as separate Compute Instance
One possisble option of how to use just one Celery worker without all other parts (i.e. broker (there is no official built-in Cloud Pub/Sub support)) and pull/push tasks to/from Cloud Pub/Sub:
run celery worker with dummy filesystem broker
add target method as #periodic_task to run i.e. every 30 seconds
at the start of the task - subscribe to Cloud Pub/Sub queue, check for new task, receive one and start processing
at the and of the task - publish to Cloud Pub/Sub results and a call to Stop this instance
There is also Cloud Tasks (timeout limit: with auto-startup - 10 minutes, manual startup - 24 hours) as a Cloud Run addition for asynchronous tasks, but in this case Cloud Pub/Sub is more suitable.

Airflow cluster: Is it needed to deploy DAGs / Workflows in all the workers?

We are planning to update Airflow and switch from single Airflow server to Airflow cluster (AWS).
We've been cheking the this article and this one.
We are using SQS as queue service and despite the documentation said that we only need to deploy our DAGs py files in the masters we wonder if this is correct.
The comunications throught queues don't include the code
In our tests, our DAGs are not working in case we don't deploy them in all nodes, workers and masters.
So, what we should do?
Many thanks!
Your DAGS need to be synced across all workers for it to work because the airflow_scheduler will send the DAG to whichever worker is available. If the DAGS are not synced across all workers, an older copy of the DAG may be run.

AWS ECS upgrade to new task definitions kills long running outgoing connections

We are using Celery for asynchronous tasks that keep a connection open to a remote server. These Celery jobs can run for up to 10 minutes.
When we deploy a new version of our code, AWS ECS won't wait for these jobs to be ready, so it kills the instances with the Celery workers before they are ready.
One solution is to tell Celery to retry it if it failed, but that could potentially cause other problems.
Is there a way to avoid this? Can we instruct AWS ECS to wait for completion of outgoing connections? Any other way to approach this?