The goal is to migrate our jobs from Control M to AWS, but before I do that I want to better understand the differences between AWS batch and AWS step functions. From what I've understood, AWS step functions seems more encompassing in that I can have one of my steps run AWS batch.
Can you explain the difference between AWS Batch and AWS Step functions? Which is better suited to migrate to from Control M? (Maybe this is preference)
AWS Batch is service to run an offline workload. With Batch, you can easily set up your offline workload using Docker and defining the set of instances types and how many instances will run this workload.
AWS Step Functions is a serverless workflow management service. It only serves you a way to connect to other AWS Services; you cannot run a script in Step Functions itself and you only define the workflow with input/output from other AWS services.
That said, you can use both services to migrate Control M to AWS and possibly other AWS Services like Lambda (for minor workload), SNS (for e-mail) and S3 (for storage).
Related
They seem to be serving the same purpose. They can both be broken down into steps, each step being a script.
A Command or Automation document can also both be part of SSM Associations in the State Manager.
So my question is simple. In which case would I need to create a Command document instead of an Automation document ?
From documentation:
Using Run Command, a capability of AWS Systems Manager, you can
remotely and securely manage the configuration of your managed nodes.
So with command documents you are executing commands on your managed instances (i.e. yum update)
Automation, a capability of AWS Systems Manager, simplifies common
maintenance, deployment, and remediation tasks for AWS services like
Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database
Service (Amazon RDS), Amazon Redshift, Amazon Simple Storage Service
(Amazon S3), and many more.
With Automation document you can interact with any AWS service to execute some actions (i.e. launch EC2 instance, crate AMI from running instance, crate RDS snapshot etc.)
Moreover you can define retries, crate process branches (i.e. when some step fails, go different path than when it success)
I have a docker image. I would like to create a container periodically and execute as a job, say every 1 hour, by creating CloudWatch Rule.
As we are using AWS cloud, I am looking at the AWS Batch service. Interestingly there is also a ECS Scheduled task.
What is the difference between these 2?
Note: I have an init container - that is I have 2 docker containers to run one after another. It seems to be possible with ECS Scheduled Task. But not with Batch.
AWS Batch is for batch jobs, such as processing numerous images or videos in parallel (one container per image/video). This is mostly useful in batch-type workloads for research purposes.
AWS Batch is based on ECS (also supports EC2), and it allows you to simply to run your containers. It does not have specific use-case, it is more generic. If you don't have batch-type projects, then ECS would be probably better choice for you.
The other answers are spot on. I just wanted to add that we (AWS container team) ran a session at re:Invent last year that covered these options and provided hints about when using one over the other. The session covers the relationship between ECS, EC2 and Fargate (something that is often missed) as well as when to use "raw" ECS, Vs Step Functions Vs Batch as an entry point for running your batch jobs. This is the link to the session.
If you want to run two containers in sequence, using AWS Fargate, then you probably want to orchestrate it with AWS Step Functions. Step Functions will allow you to call arbitrary tasks in serial, and it has direct integration with AWS Fargate.
Amazon EventBridge Rule (hourly) ----- uses AWS IAM role to gain permission to trigger Step Functions
|
| triggers
|
AWS Step Functions ----- Uses AWS IAM role to gain permission to trigger Fargate
|
| triggers
|
AWS Fargate (Amazon ECS) Task Definition
AWS Batch is designed for data processing tasks that need to scale out across many nodes. If your use case is simply to spin up a couple of containers in sequence, then AWS Batch will be overkill.
CloudWatch Event Rules
FYI CloudWatch Event Rules still work, but the service has been rebranded as Amazon EventBridge. I'd recommend using the Amazon EventBridge console and APIs instead of Amazon CloudWatch Events APIs going forward.
This is going to be a fairly general question. I have a pipeline that I would like to execute in real time. The pipeline can have sudden and unpredictable load changes, so scalability (both up and down) are important. The pipeline stages can be packaged as docker containers though they don't necessarily start that way.
I see three ways to build said pipeline on AWS. 1) I can write an Airflow DAG and use AWS managed workflows for Apache airflow. 2) I can write an AWS lambda pipeline with AWS step functions. 3) I can write a Kubeflow pipeline on top of AWS EKS.
These three options have different ramifications in terms of cost and scalability, I would presume. E.g. scaling a Kubernetes cluster in AWS EKS will be a lot slower than scaling Lambda functions assuming I don't hit the service quota for Lambdas. Can someone comment on the scalability of AWS managed Airflow? Does it scale faster than EKS? How does it compare to AWS Lambdas?
Why not use Airflow to orchestrate the entire pipeline? Airflow can certainly invoke a Step Function using the StepFunctionStartExecutionOperator or by writing a custom Python function to do the same with the PythonOperator.
Seems like this solution would be the best of both worlds: true data orchestration, monitoring, and alerting in Airflow (while keeping a fairly light Airflow instance since it's pure orchestration) with the scalability and responsiveness in AWS Lambda.
I've used this method for a very similar use case in the past and it worked like a charm. Plus, if you need to scale this pipeline to integrate with other services and systems in the future, Airflow gives you that flexibility because it's an orchestrator and is system- and provider-agnostic.
I made a classifier in Python that uses a lot of libraries. I have uploaded the model to Amazon S3 as a pickle (my_model.pkl). Ideally, every time someone uploads a file to a specific S3 bucket, it should trigger an AWS Lambda that would load the classifier, return predictions and save a few files on an Amazon S3 bucket.
I want to know if it is possible to use a Lambda to execute a Jupyter Notebook in AWS SageMaker. This way I would not have to worry about the dependencies and would generally make the classification more straight forward.
So, is there a way to use an AWS Lambda to execute a Jupyter Notebook?
Scheduling notebook execution is a bit of a SageMaker anti-pattern, because (1) you would need to manage data I/O (training set, trained model) yourself, (2) you would need to manage metadata tracking yourself, (3) you cannot run on distributed hardware and (4) you cannot use Spot. Instead, it is recommended for scheduled task to leverage the various SageMaker long-running, background job APIs: SageMaker Training, SageMaker Processing or SageMaker Batch Transform (in the case of a batch inference).
That being said, if you still want to schedule a notebook to run, you can do it in a variety of ways:
in the SageMaker CICD Reinvent 2018 Video, Notebooks are launched as Cloudformation templates, and their execution is automated via a SageMaker lifecycle configuration.
AWS released this blog post to document how to launch Notebooks from within Processing jobs
But again, my recommendation for scheduled tasks would be to remove them from Jupyter, turn them into scripts and run them in SageMaker Training
No matter your choices, all those tasks can be launched as API calls from within a Lambda function, as long as the function role has appropriate permissions
I agree with Olivier. Using Sagemaker for Notebook execution might not be the right tool for the job.
Papermill is the framework to run Jupyter Notebooks in this fashion.
You can consider trying this. This allows you to deploy your Jupyter Notebook directly as serverless cloud function and uses Papermill behind the scene.
Disclaimer: I work for Clouderizer.
It totally possible, not an anti-pattern at all. It really depends on your use-case. AWs actually made a great article describing it, which includes a lambda
I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.
TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.
AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.