Running ML preprocessing job in AWS - amazon-web-services

My dear people,
I am running some processing jobs in some files stored in S3, I just need regular computing power without GPU. I found I can use both Sagemaker preprocessing jobs using Sagemaker SDK; and also can do the exact same task using Fargate based AWS ECS Task using the Python-based ECS SDK. When I compare the procedure it seems like both of them are very similar which is
Build a docker image
Push the image to ECR
Configure Fargate or ECS
Run the task
Also from the pricing model, it seems really close; so I am wondering why on the same platform there are two services doing a very similar thing. Can anyone explain the motivation behind it and which service to use when if I don't need GPU?

Related

Deployment to AWS ECS

I am trying to automate the deployment of the AWS ECS and couldn't find much information I could do that and will like to see if there is any advice on what I can explore. Currently, we have an Azure DevOps pipeline that will push the containerized image to the ECR and we will manually create the task definition at ecs and update the service afterwards. Is there anyway that I can automate this with azure devops release?
A bit open ended for a Stackoverflow style question but the short answer is that there are a lot of AWS native alternatives to this. This is an example that implements the blue-green pattern (it can be simplified with a more generic rolling update deployment). If you are new to ECS you probably want to consider using Copilot. This is a entry level blog that hints about how to deploy an application and build a pipeline for it.

Can I improve my setup in AWS for running (machine learning) python scripts in a container when a file is uploaded to S3?

I have a working setup in AWS that looks something like:
The point is that whenever a file is uploaded to S3, it will trigger a lambda that will trigger a Codebuild project. The codebuild project is then based on a docker image (Stored at ECR) and needs to run a few bash command, mainly executing python files in the docker image. That works really well actually.
The files in S3 are updated approximately once a day and each execution in codebuild takes around 4 minutes.
I got the question why I am not using fargate/SageMaker (the scripts are basicly machine learning retraining and predictions). I was just thinking about if there would be any advantages in using Fargate and/or SageMaker for this? Is it e.g. possible to use Fargate and execute bash commands inside the container when triggered?
IIUC, You're wondering the diffenerce between CodeBuild and Fargate/SageMaker
Price
Calcute the price of these three products using the links below.
Pricing Fargate
Pricing SageMaker
Pricing CodeBuild
As your said, you're using the docker image as the main training tools, so maybe the Fargateis more suitable for your scenario.

Use AWS Lambda to execute a jupyter notebook on AWS Sagemaker

I made a classifier in Python that uses a lot of libraries. I have uploaded the model to Amazon S3 as a pickle (my_model.pkl). Ideally, every time someone uploads a file to a specific S3 bucket, it should trigger an AWS Lambda that would load the classifier, return predictions and save a few files on an Amazon S3 bucket.
I want to know if it is possible to use a Lambda to execute a Jupyter Notebook in AWS SageMaker. This way I would not have to worry about the dependencies and would generally make the classification more straight forward.
So, is there a way to use an AWS Lambda to execute a Jupyter Notebook?
Scheduling notebook execution is a bit of a SageMaker anti-pattern, because (1) you would need to manage data I/O (training set, trained model) yourself, (2) you would need to manage metadata tracking yourself, (3) you cannot run on distributed hardware and (4) you cannot use Spot. Instead, it is recommended for scheduled task to leverage the various SageMaker long-running, background job APIs: SageMaker Training, SageMaker Processing or SageMaker Batch Transform (in the case of a batch inference).
That being said, if you still want to schedule a notebook to run, you can do it in a variety of ways:
in the SageMaker CICD Reinvent 2018 Video, Notebooks are launched as Cloudformation templates, and their execution is automated via a SageMaker lifecycle configuration.
AWS released this blog post to document how to launch Notebooks from within Processing jobs
But again, my recommendation for scheduled tasks would be to remove them from Jupyter, turn them into scripts and run them in SageMaker Training
No matter your choices, all those tasks can be launched as API calls from within a Lambda function, as long as the function role has appropriate permissions
I agree with Olivier. Using Sagemaker for Notebook execution might not be the right tool for the job.
Papermill is the framework to run Jupyter Notebooks in this fashion.
You can consider trying this. This allows you to deploy your Jupyter Notebook directly as serverless cloud function and uses Papermill behind the scene.
Disclaimer: I work for Clouderizer.
It totally possible, not an anti-pattern at all. It really depends on your use-case. AWs actually made a great article describing it, which includes a lambda

AWS Batch vs AWS CodeBuild

I am very new in AWS and when I was searching something to download a code from GitHub (a python project), run it, and save the output in s3 the first service that I found was CodeBuild.
So I implement this kind of workflow using CodeBuild.
But now I have seen that AWS have a service called AWS Batch and I am wondering if I should migrate my arquitecture to AWS Batch.
Can you explain which one - AWS CodeBuild or AWS Batch - is more suitable with my case? When use AWS Batch instead of AWS CodeBuild?
Thank very much.
TLDR Summary: AWS Codebuild is the nicer choice for simple jobs.
My (reverse) experience...
I needed to run a simple job that pulls data from external api, read/write to external database, and generate a CSV report.
The job takes ~1 hour to run, so AWS Lambda is out of the picture.
After some googling, I found AWS Batch and decided to give Creating a Simple “Fetch & Run” AWS Batch Job a try.
The required steps to this "simple" job working:
Build a Docker image with the fetch & run script
Create an Amazon ECR repository for the image
Push the built image to ECR
Create a simple job script and upload it to S3
Create an IAM role to be used by jobs to access S3
Configure a compute environment
Create a job queue
Create a job definition that uses the built image
Submit and run a job that execute the job script from S3
After spending the time to create all these resources, it did not work out of the box. I found myself debugging random things I shouldn't have to debug such as:
Dockerfile
entrypoint script
ECS cluster
EC2 instance and autoscaling group
After failing to find simple practical examples, and realizing the amount of effort required, I decided to explore other solutions.
I stumbled onto Using AWS CodeBuild to execute administrative tasks and this post.
I've used AWS Codebuild in the past for CI/CD pipeline, and thought "what the hell, lets give it a try". In a shorter amount of time, I was able to get a "codebuild job" running on a cloudwatch scheduler and codebuild slack notifications added, with less effort:
Connect build project to your source code
Select a runtime environment
Create IAM role
Create a buildspec.yml and add runtime commands
One major advantage is that CodeBuild runs tasks in a full-blown Linux environment.
Drawbacks:
Max execution time of 8 hours
AWS Codebuild was much easier to get working for my simple job.
Sorry for the long post, just wanted to share my experience with these 2 services.
AWS Batch is used highly parallel computations, e.g., processing large number of images at the same time:
AWS Batch enables you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources, and AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software.
Thus its not suited for what you are trying to use it. CodeBuild is better choice, based on your description.

AWS-Batch vs EC2 vs AWS Workspaces for running batch scripts to load data to Redshift

I have multiple CSV files containing data for different tables, with different file sizes varying from 1 MB to 1.5 GB. I want to process the data (replace/remove values of columns) row by row and then load the data to existing Redshift tables. This is once a day batch processing.
AWS Lambda:
Lambda has limitations of memory, hence I was not able to run process for large CSV files.
EC2: I already have EC2 instance where I am running python script to transform and load the data to redshift.
I have keep EC2 running all the time, which has all python scripts which I want to run for all tables and environment created (installing python, psycopg lib etc), leads to more cost.
AWS Batch:
I created a container image which has all the setup to run the python scripts, and pushed it to ECR.
I then set up AWS Batch job, which can take this container image and run it through ECS.
This is more optimized, I only pay for EC2 used and ECR image storage.
But all the development and unit testing I will have to do on my personal desktop and then push a container, no inline AWS service to test.
AWS Workspaces:
I am not much familiar with AWS Workspaces, but need inputs, can this also be used as aws batch to start and stop instance when required and run python scripts on that, edit or test scripts.
Also, Can I schedule it to run everyday at defined time?
I need a inputs on which service is best suitable, optimized solution for such use-case? Or It would also be great if anyone suggests a better way to use above services I mentioned in better way.
Batch is best suited for your use case. I see that your concern about batch is about the development and unit testing on your personal desktop. You can automate that process using AWS ECR, CodePipeline, CodeCommit and CodeBuild. Setup a pipeline to detect changes made to your code repo, build the image and push it to ECR. Batch can pick up the latest image from there.