How to run resource intensive tasks with Airflow - amazon-web-services

We have a long running (3h) model training task which runs every 3 days and smaller prediction pipelines that run daily.
For both cases we use Jenkins + EC2 plugin to spin up large instances(workers) and run pipelines on them. This serves 2 purposes:
Keep pipelines isolated. So every pipeline has all resources of one instance.
We save costs. Large instance run only for several hours and not 24/7
With Jenkins + EC2 plugin I am not responsible for copying code to worker and reporting the result of the execution back. Jenkins does it under the hood.
Are there anyways to achieve the same behaviour with Airflow?

Airflow 1.10 released a host of new AWS integrations that gives you a few options for doing something like this on AWS.
https://airflow.apache.org/integration.html#aws-amazon-web-services
If you are running your task in a containerized setting, it sounds like the ECSOperator or the KubernetesPodOperator could be what you need (if you're using Kubernetes).

Related

AWS Managed Apache airflow (MWAA) or AWS Batch for simple batch job flow

I have simple workflow to design where there will be 4 batch job running one after another sequentially and each jobs is running in multi node master/slave architecture.
My question is AWS Batch can manage simple workflow using job queue and can manage multi-node parallel job as well.
Now, should I use AWS Batch or Airflow ?
With Airflow , I can use KubernetesPodOperator and job will run in Kubernetes cluster. But Airflow does not inherently support multi node parallel jobs.
Note: The batch job is written in java using Spring batch remote partitioning framework that support master/slave architecture.
AWS Batch would fit your requirements better.
Airflow is a workflow orchestration tool, it's used to host many jobs that have multiple tasks each, with each task being light on processing. Its most common use is for ETL, but in your use case you would have an entire Airflow ecosystem for just a single job, which (unless you manually broke it out to smaller tasks) would not run multi-threaded.
AWS Batch on the other hand is for batch processing, and you can more finely-tune the servers/nodes that you want your code to execute on. I think in your use case it would also work out cheaper than Airflow too.

Schedule Docker image to be run periodically on AWS ECS?

How do I schedule a docker image to be run periodically (hourly) using ECS and without having to use a continually running EC2 instance + cron? I have a docker image containing third party binaries and the python project.
The latter approach is not viable long-term as it's expensive for the instance to be running 24/7, while only being used for a small fraction of the day given invocation of the script only lasts ~3 minutes.
For AWS ECS cluster, it is recommended to have atleast 1 EC2 server running 24x7. Have you looked at AWS Fargate whether it can run your docker container?. Also AWS Batch?. If Fargate and AWS Batch are not possible then for your requirement, I would recommend something like this without ECS.
Build an EC2 AMI with pre-built docker and required softwares and libraries.
Have AWS Instance Scheduler to spin up a EC2 server every hour and as part of user data, start a docker container with image you mentioned.
https://aws.amazon.com/answers/infrastructure-management/instance-scheduler/
If you know your task execution time maybe 5min. After 8 or 10min then bring server down with scheduler.
Above approach will blindly start a EC2 and stop it without knowing whether your python work is done successfully. We can still improve above with Lambda and CloudFormation templates combination. Let me know your thoughts :)
Actually it's possible to schedule the launch directly in CloudWatch defining a rule, as explained in
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduled_tasks.html
This solution is cleaner, because you will not need to worry about the execution time: once finished, the Task will just terminate and a new one will be spawned on the next cycle

Parallel processing with load balancing on AWS

I have the below use case. Need some help in figuring out the best options on AWS.
I have a python script which needs to be executed for 200 different datasets.
I need to run each dataset in an AWS instance. Maximum instance I can have is 10 (so 20 times I need to ran on 10 instances parallelly to complete my 200 jobs)
All the instances will use a common Mongo DB instance to store/read data for the python scripts.
This is not an web application. Just a simple python script invocation.
The python script won't provide any exit codes once its completed (3rd party script and don't have control over it). So I need to figure out the AWS instance completes the job so I can send the next dataset for process (kind of load balancing).
Sounds like a typical use case for SQS, a distributed queue.
Auto Scaling Group managing EC2 Instances
SQS queue managing calculation jobs
Small script polling new jobs from SQS and executing Python script
CloudWatch alarms scaling up and down Auto Scaling Group based on number of jobs in SQS queue
General approach: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-using-sqs-queue.html
Using PaaS Elastic Beanstalk for this kind of setup: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
Example implementation: https://cloudonaut.io/antivirus-for-s3-buckets/

Amazon web service Data pipeline

Can we use existing ec2 instance details while configuring data pipeline? If it is possible then what are the ec2 details that we need to provide while creating a pipe line?
Yes, it is possible. According to AWS support.
"You can install Task Runner on computational resources that you manage, such as an Amazon EC2 instance, or a physical server or workstation. Task Runner can be installed anywhere, on any compatible hardware or operating system, provided that it can communicate with the AWS Data Pipeline web service.
This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. The Task Runner logs persist after pipeline execution is complete."
I did this myself as it takes a while to get the pipeline to start up, this start up time could be 10-15 minutes depending on unknown factors.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

Mesos, Marathon, the cloud and 10 data centers - How to talk to each other?

I've been looking into Mesos, Marathon and Chronos combo to host a large number of websites. In my head I should be able to type a few commands into my laptop, and wait about 30 minutes for the thing to build and deploy.
My only issue, is that my resources are scattered across multiple data centers, numerous cloud accounts, and about 6 on premises places. I see no reason why I can't control them all from my laptop -- (I have serious power and control issues when it comes to my hardware!)
I'm thinking that my best approach is to build the brains in the cloud, (zoo keeper and at least one master), and then add on the separate data centers, but I am yet to see any examples of a distributed cluster, where not all the nodes can talk to each other.
Can anyone recommend a way of doing this?
I've got a setup like this, that i'd like to recommend:
Source code, deployment scripts and dockerfiles in GIT
Each webservice has its own directory and comes together with a dockerfile to containerize it
A build script (shell script running docker builds) builds all the docker containers, of which all images are pushed to a docker image repository
A ansible deploy deploys all the containers remotely to a set of VPSes. (You use your own deployment procedure, that fits mesos/marathon)
As part of the process, a activeMQ broker is deployed to the cloud (yep, in a container). While deploying, it supplies each node with the URL of the broker they need to connect to. In your setup you could instead use ZooKeeper or etcd for example.
I am also using jenkins to do automatic rebuilds and to run deploys whenever there has been GIT commits, but they can also be done manually.
Rebuilds are lightning fast, and deploys dont take much time either. I can replicate everything I have in my repository endlessly and have zero configuration.
To be able to do a new deploy, all I need is a set of VPSs with docker daemons, and some datastores for persistence. Im not sure if this is something that you can replace with mesos, but ansible will definitely be able to install a mesos cloud for you onto your hardware.
All logging is being done with logstash, to a central logging server.
i have setup a 3 master, 5 slave, 1 gateway mesos/marathon/docker setup and documented here
https://github.com/debianmaster/Notes/wiki/Mesos-marathon-Docker-cluster-setup-on-RHEL-7-with-three-master
this may help you in understanding the load balancing / scaling across different machines in your data center
1) masters can also be used as slaves
2) mesos haproxy bridge script can be used for service discovery of the newly created services in the cluster
3) gateway haproxy is updated every min with new services that are created
This documentation has
1) master/slave setup
2) setting up haproxy that automatically reloads
3) setting up dockers
4) example service program
You should use Terraform to orchestrate your infrastructure as code.
Terraform has a lot of providers that allows you to manage different resources accross multiples clouds services and/or bare-metal resources such as vSphere.
You can start with the Getting Started Guide.