Apache Airflow - how many tasks in a DAG is too many? - airflow-scheduler

I tried having a DAG with 400 tasks (like each one running calling remote spark server to process a separate data file into s3...nothing to do with mysql) and airflow (v1.10.3) did the following for the next 15mins:
cpu stayed at 99%
did not handle new putty login or ssh requests to
my machine (amazon linux)
airflow webserver stopped responding..only
gave 504 errors
Started 130 concurrent connections to mysql RDS
(airflow metadb)
kept my tasks stuck in scheduled state
i eventually switched to another ec2 instance but got same outcome...
I am running LocalExecutor on single machine (16 CPUs).
Note for a DAG with 30 tasks it runs fine.

There's no actual limit to the number of tasks in a DAG. In your case, you're using LocalExecutor - airflow will then use any resources available on the host to execute the tasks. It sounds like you just overwhelmed your ec2 instance's resources and overloaded the airflow worker(s) / scheduler. I'd recommend adding more workers to break up the tasks or lowering the parallelism value in your airflow.cfg

Related

Django + Gunicorn on Google Cloud Run, how are different parameters of Gunicorn and GCR related?

For deploying a Django web app to GCR, I would like to understand the relationships between various autoscaling related parameters of Gunicorn and GCR.
Gunicorn has flags like:
workers
threads
timeout
Google Cloud Run has these configuration options:
CPU limit
Min instances
Max instances
Concurrency
My understanding so far:
Number of workers set in Gunicorn should match the CPU limit of GCR.
We set timeout to 0 in Gunicorn to allow GCP autoscale the GCR instance.
GCP will always keep some instances alive, this number is Min instances.
When more traffic comes, GCP will autoscale up to a certain number, this number is Max instances.
I want to know the role of threads (Gunicorn) and concurrency (GCR) in autoscaling. More specifically:
How does the number of thread in Gunicorn affect autoscaling?
I think This should not affect autoscaling at all. They are useful for background tasks such as file operations, making async calls etc.
How does the Concurrency setting of GCR affect autoscaling?
If number or workers is set to 1, then a particular instance should be able to handle only one request at a time. So setting this value to anything more than 1 does not help. In fact, We should set CPU limit, concurrency, workers these three to match each other. Please let me know if this is correct.
Edit 1:
Adding some details in response to John Hanley's commment.
We expect to have up to 100 req/s. This is based on what we've seen in GCP console. If our business grows we'll get more traffic. So I would like to understand how the final decision changes if we're to expect say 200 or 500 req/s.
We expect requests to arrive in bursts. Users are groups of people who perform some activities on our web app during a given time window. There can be only one such event on a given day, but the event will see 1000 or more users using our services for a 30 minute window. On busy days, we can have multiple events, some of them may overlap. The service will be idle outside of the event times.
How many simultaneous requests can a cloud run instance handle? I am trying to understand this one myself. Without cloud run, I could've deployed this with x number of workers and then the answer would've been x. But with cloud run, I don't know if the number of gunicorn workers have the same meaning.
Edit 2: more details.
The application is stateless.
The web app reads and writes to DB.

Airflow Scheduler - Ephemeral Storage - Evicted

I've been runnning into what should be a simple issue with my airflow scheduler. Every couple of weeks, the scheduler becomes Evicted. When I run a describe on the pod, the issue is because The node was low on resource: ephemeral-storage. Container scheduler was using 14386916Ki, which exceeds its request of 0.
The question is two fold. First, why is the scheduler utilizing ephemeral-storage? And second, is it possible to do add ephemeral-storage when running on eks?
Thanks!
I believe Ephemeral Storage is not Airflow's question but more of the configuration of your K8S cluster.
Assuming we are talking about OpenShift' ephemeral storage:
https://docs.openshift.com/container-platform/4.9/storage/understanding-ephemeral-storage.html
This can be configured in your cluster and it wil make "/var/log" ephemeral.
I think the problem is that /var/logs gets full. Possibly some of the system logs (not from airlfow but from some other processes running in the same container). I think a solution will be to have a job that cleans that system log periodically.
For example we have this script that cleans-up Airlfow logs:
https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh

Limit concurrency of AWS Ecs tasks

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.
I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.
I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

Auto scaling service in AWS without duplicating cron jobs

I have a (golang web server) service running on AWS on a EC2 (no auto scaling). This service has a few cron jobs that runs throughout the day and these jobs starts when the service starts.
I would like to take advantage of auto scaling in some form on AWS. Been looking at ECS and Beanstalk.
When I add auto scaling I need the cron job to only execute on one of the scaled services due to rate limits on external APIs. Right now the cron job is tightly coupled within the service and I am looking for an option that does not require moving the cron job to its own service.
How can I achieve this in a good way using AWS?
You're going to get this problem as a general issue in any scalable application where crons cannot / should not run multiple times. It's not really AWS specific. I'm not sure to what extent you want to keep things coupled or how your crons are currently run but here are a few suggestions that might work for you:
Create a "cron runner" instance with a limit to run crons on
You could create a separate ECS service which has no autoscaling and a fixed value of 1 instance. This instance would run the same copy of your code as your "normal" instances and would run crons. You would turn crons off on your "normal" instances. You might find that this can be a very small instance since it doesn't handle any web traffic.
Create a "cron trigger" instance which fires off crons remotely
Here you create one "trigger" instance which sends a request to your normal instances through an ALB. Because your ALB will route the request to 1 of the servers behind it the cron only gets run once. One watch out with this is that if your cron is long running, you may need to consider your request timeouts. You'll also have to think about retries etc but I assume you already have a process that can be adapted for that.
The above solutions can be adapted with message queues etc but the basis of both is that there is another instance of some kind which starts the cron and is separate from your normal servers. Depending on when your cron runs, you may only need to run this cron instance for a few hours per day so it can be cost efficient to do things like this.
Personally I have used both methods in a multi-tenant application and I had to go with the option of running the cron like this due to the number of tenants and the time / resource it took to run the crons for all of them at once:
Cloudwatch schedule triggers a lambda which sends a message to SQS to queue a cron for each tenant individually.
Cron servers (totally separate from main web servers but running same / similar code) pull messages and run the cron for each tenant individually. Stores a key in redis for crons which are vital to only run once to stop issues with "at least once" delivery so crons don't run twice.
This can also help handle failures with retry policies and deadletter queues managed in SQS.
Ultimately you need to kick off these crons from one place. If possible, change up your crons so it doesn't matter if they run twice. It makes it easier to deal with retries and things like that.

Apache Airflow - Run task on EC2

We're considering migrating our data pipelines to Airflow and one item we require is the ability for a task to create, execute on, and destroy an EC2 instance. I know that Airflow supports ECS and Fargate which will have a similar effect, but not all of our tasks will fit directly into that paradigm without significant refactoring.
I see that we can use a distributed executor and scale the pool of workers up and down manually, but we really don't need to have workers up all the time, only occasionally, and when we do we're just as well served by having a dedicated machine for each task as it runs, destroying each machine as the task completes.
The idea I have stuck in my head would be something like a "EphemeralEC2Operator", which would stand up a machine, SSH in, run a bash script which orchestrates the task, and then tear the machine down.
Does this capability exist, or would we have to implement it ourselves?
Thanks in advance.