Apache Airflow - Run task on EC2 - amazon-web-services

Apache Airflow - Run task on EC2 - amazon-web-services

We're considering migrating our data pipelines to Airflow and one item we require is the ability for a task to create, execute on, and destroy an EC2 instance. I know that Airflow supports ECS and Fargate which will have a similar effect, but not all of our tasks will fit directly into that paradigm without significant refactoring.
I see that we can use a distributed executor and scale the pool of workers up and down manually, but we really don't need to have workers up all the time, only occasionally, and when we do we're just as well served by having a dedicated machine for each task as it runs, destroying each machine as the task completes.
The idea I have stuck in my head would be something like a "EphemeralEC2Operator", which would stand up a machine, SSH in, run a bash script which orchestrates the task, and then tear the machine down.
Does this capability exist, or would we have to implement it ourselves?
Thanks in advance.

Related

Limit concurrency of AWS Ecs tasks

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.

I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.

I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

Auto scaling service in AWS without duplicating cron jobs

I have a (golang web server) service running on AWS on a EC2 (no auto scaling). This service has a few cron jobs that runs throughout the day and these jobs starts when the service starts.
I would like to take advantage of auto scaling in some form on AWS. Been looking at ECS and Beanstalk.
When I add auto scaling I need the cron job to only execute on one of the scaled services due to rate limits on external APIs. Right now the cron job is tightly coupled within the service and I am looking for an option that does not require moving the cron job to its own service.
How can I achieve this in a good way using AWS?

You're going to get this problem as a general issue in any scalable application where crons cannot / should not run multiple times. It's not really AWS specific. I'm not sure to what extent you want to keep things coupled or how your crons are currently run but here are a few suggestions that might work for you:
Create a "cron runner" instance with a limit to run crons on
You could create a separate ECS service which has no autoscaling and a fixed value of 1 instance. This instance would run the same copy of your code as your "normal" instances and would run crons. You would turn crons off on your "normal" instances. You might find that this can be a very small instance since it doesn't handle any web traffic.
Create a "cron trigger" instance which fires off crons remotely
Here you create one "trigger" instance which sends a request to your normal instances through an ALB. Because your ALB will route the request to 1 of the servers behind it the cron only gets run once. One watch out with this is that if your cron is long running, you may need to consider your request timeouts. You'll also have to think about retries etc but I assume you already have a process that can be adapted for that.
The above solutions can be adapted with message queues etc but the basis of both is that there is another instance of some kind which starts the cron and is separate from your normal servers. Depending on when your cron runs, you may only need to run this cron instance for a few hours per day so it can be cost efficient to do things like this.
Personally I have used both methods in a multi-tenant application and I had to go with the option of running the cron like this due to the number of tenants and the time / resource it took to run the crons for all of them at once:
Cloudwatch schedule triggers a lambda which sends a message to SQS to queue a cron for each tenant individually.
Cron servers (totally separate from main web servers but running same / similar code) pull messages and run the cron for each tenant individually. Stores a key in redis for crons which are vital to only run once to stop issues with "at least once" delivery so crons don't run twice.
This can also help handle failures with retry policies and deadletter queues managed in SQS.
Ultimately you need to kick off these crons from one place. If possible, change up your crons so it doesn't matter if they run twice. It makes it easier to deal with retries and things like that.

Is anyone implemented django celery worker as docker container it only runs when task assigned

I am able to successfully deploy Django Celery worker as a docker container in AWS ECS service using FARGATE as computing.
But my concern is that celery container is running 24/7. If I can run container only when task is assigned, I can save lot of money as per AWS FARGATE billing methodology.

Celery isn't really the right thing to use because it's designed to persist, but the goal should be reasonably easy to achieve.
Architecturally, you probably want to run a script on a Fargate task. The script chews through the queue and then dies. You'd trigger that task somehow:
An API call from your data receiver (e.g. Django)
A lambda function (triggered by what?)
Still some open questions... do you limit yourself to one task at a time or do you need to manage concurrent requests to the queue? Do you retry? But a plausible place to start.
A not-recommended but perhaps easier way to do it would be to run a celery worker in your Django container (e.g. using supervisor) and use Fargate's autoscaling features. You'd always have the one Django container running to receive data. If the celery worker on that container used up all of the available resources, Fargate would scale the service by adding tasks. Once the jobs were done, it'd remove the excess containers. You'd be paying the "overhead" for Django in each container, but it could cost you less than an always-on celery container and would certainly be simpler -- leverage your celery experience and avoid the extra layer of event handling.
EDIT: Another disadvantage of this version is that you need to run Redis somewhere and I've found the minimum cost for this to be relatively high.
Based on my growing AWS experience, here's what you probably should do...
Use AWS API Gateway as an always-on receiver of events/requests. You only pay for requests, the free tier includes a million per month, and the next 300M are $1 (pricing) so this is likely to be free.
While you have many options for responding to the request, an AWS Lambda function (which can be written in python) should have the least overhead.
If your queue will run longer than a Lambda function allows (15 minutes), you'll need to have that Lambda function delegate the processing to e.g. a Fargate task.
(Optional) If you want to user a Dockerhub container for your Fargate task, we experienced a bunch of issues with Tasks and Services failing to start due to rate limits at Dockerhub. We ended up wrapping our Fargate task in a Step Function that checked for this error specifically and retried.
(Optional) If you need to limit concurrency, this SO answer suggests having your Lambda function check for an existing execution (of a Step Function or Fargate task). I was hoping there was something native on Fargate Tasks or Step Functions but I don't see anything.
I imagine this would represent a huge operating cost savings over the always-on Fargate task and Elasticache Redis queue, but the up-front cost/hassle could exceed the savings.

Have you thought of using AWS Lambda instead of the celery worker? You would then pay per task execution, where cost is driven by execution time and memory usage. If you have an application which is mostly idle then paying per request, skipping the idle cost, would make the most sense.

AWS Container (ECS) vs AMI & Spot instances

The core of my question is whether or not there are downsides to using an Amazon Machine Image + Micro Spot instances to run a task, vs using the Elastic Container Service (ECS).
Here's my situation: I have the need to run a task on demand that is triggered by a remote web hook.
There is the possibility this task can get triggered 10 times in a row, or go weeks w/o ever executing, so I definitely want a service that only runs (and bills) on demand.
My plan is to point the webhook to a Lambda function, but then the question is what to have the Lambda function do.
Tho it doesn't take very long, this task requires several different runtimes (Powershell Core, Python, PHP, Git) to get its job done, so Lambda isn't really a possibility as I'd hit the deployment package size limit. But I can use Lambda to kick off the job.
What I started doing was creating an AMI that has all the necessary runtimes and code, then using a Spot request to launch an instance, have it execute the operation via a startup script passed in via userdata, then shut itself down when it's done. I'd have to put in some rate control logic to prevent two from running at once, but that's a solvable problem.
I hesitated half way through developing this solution when I realized I could probably do this with a docker container on ECS using Fargate.
I just don't know if there is any benefit of putting in the additional development time of switching to a docker container, when I am not a docker pro and already have the AMI configured. Plus ECS/Fargate is actually more expensive than just running a micro instance.
Are these any concerns about spinning up short-lived (<5min) spot requests (t3a-micro) where there could be a dozen fired off in a single day? Are there rate limits about this? Will I get an angry email from AWS telling me to knock it off? Are there other reasons ECS is the only right answer? Something else entirely?

Your solution using spot instance and AMI is a valid one, though I've experienced slow times to get a spot instance in the past. You also incur the AMI startup time.
As mentioned in the comments, you will incur a minimum of 1 hour charge for the instance, so you should leave your instance up for the hour before terminating, in case more requests can come in the same hour.
IMHO you should build it all with lambda. By splitting the workload for each runtime into its own lambda you can make it work.
AWS supports python, powershell runtimes, and you can create a custom PHP one. Chain them together with your glue of choice, SNS, SQS, direct invocation, or Step Functions, and you have the most cost effective solution. You also get the benefits of better and independent maintenance for each function/runtime.
Put the initial lambda behind API gateway and you will get rate limiting capabiltiy too.

Launch and shutting down instances suited for AWS ECS or Kubernetes?

I am trying to create a certain kind of networking infrastructure, and have been looking at Amazon ECS and Kubernetes. However I am not quite sure if these systems do what I am actually seeking, or if I am contorting them to something else. If I could describe my task at hand, could someone please verify if Amazon ECS or Kubernetes actually will aid me in this effort, and this is the right way to think about it?
What I am trying to do is on-demand single-task processing on an AWS instance. What I mean by this is, I have a resource heavy application which I want to run in the cloud and have process a chunk of data submitted by a user. I want to submit a this data to be processed on the application, have an EC2 instance spin up, process the data, upload the results to S3, and then shutdown the EC2 instance.
I have already put together a functioning solution for this using Simple Queue Service, EC2 and Lambda. But I am wondering would ECS or Kubernetes make this simpler? I have been going through the ECS documenation and it seems like it is not very concerned with starting up and shutting down instances. It seems like it wants to have an instance that is constantly running, then docker images are fed to it as task to run. Can Amazon ECS be configured so if there are no task running it automatically shuts down all instances?
Also I am not understanding how exactly I would submit a specific chunk of data to be processed. It seems like "Tasks" as defined in Amazon ECS really correspond to a single Docker container, not so much what kind of data that Docker container will process. Is that correct? So would I still need to feed the data-to-be-processed into the instances via simple queue service, or other? Then use Lambda to poll those queues to see if they should submit tasks to ECS?
This is my naive understanding of this right now, if anyone could help me understand the things I've described better, or point me to better ways of thinking about this it would be appreciated.

This is a complex subject and many details for a good answer depend on the exact requirements of your domain / system. So the following information is based on the very high level description you gave.
A lot of the features of ECS, kubernetes etc. are geared towards allowing a distributed application that acts as a single service and is horizontally scalable, upgradeable and maintanable. This means it helps with unifying service interfacing, load balancing, service reliability, zero-downtime-maintenance, scaling the number of worker nodes up/down based on demand (or other metrics), etc.
The following describes a high level idea for a solution for your use case with kubernetes (which is a bit more versatile than AWS ECS).
So for your use case you could set up a kubernetes cluster that runs a distributed event queue, for example an Apache Pulsar cluster, as well as an application cluster that is being sent queue events for processing. Your application cluster size could scale automatically with the number of unprocessed events in the queue (custom pod autoscaler). The cluster infrastructure would be configured to scale automatically based on the number of scheduled pods (pods reserve capacity on the infrastructure).
You would have to make sure your application can run in a stateless form in a container.
The main benefit I see over your current solution would be cloud provider independence as well as some general benefits from running a containerized system: 1. not having to worry about the exact setup of your EC2-Instances in terms of operating system dependencies of your workload. 2. being able to address the processing application as a single service. 3. Potentially increased reliability, for example in case of errors.
Regarding your exact questions:
Can Amazon ECS be configured so if there are no task running it
automatically shuts down all instances?
The keyword here is autoscaling. Note that there are two levels of scaling: 1. Infrastructure scaling (number of EC2 instances) and application service scaling (number of application containers/tasks deployed). ECS infrastructure scaling works based on EC2 autoscaling groups. For more info see this link . For application service scaling and serverless ECS (Fargate) see this link.
Also I am not understanding how exactly I would submit a specific
chunk of data to be processed. It seems like "Tasks" as defined in
Amazon ECS really correspond to a single Docker container, not so much
what kind of data that Docker container will process. Is that correct?
A "Task Definition" in ECS is describing how one or multiple docker containers can be deployed for a purpose and what its environment / limits should be. A task is a single instance that is run in a "Service" which itself can deploy a single or multiple tasks. Similar concepts are Pod and Service/Deployment in kubernetes.
So would I still need to feed the data-to-be-processed into the
instances via simple queue service, or other? Then use Lambda to poll
those queues to see if they should submit tasks to ECS?
A queue is always helpful in decoupling the service requests from processing and to make sure you don't lose requests. It is not required if your application service cluster can offer a service interface and process incoming requests directly in a reliable fashion. But if your application cluster has to scale up/down frequently that may impact its ability to reliably process.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js