I have been using sagemaker recently and am using inference with GPU-based instances.
I am thinking of turning off sagemaker inference instances at night—for example, 8 pm to 8 am.
I want to do that using cdk. Not sure if it is a crazy idea or not?
Any help?
Amazon SageMaker supports different inference options that fits various use cases. You can use SageMaker Asynchronous endpoints to save cost during idle time (after operational hours), you don't have to use AWS CDK/ AWS CloudFormation while using this option.
Amazon SageMaker supports automatic scaling (autoscaling) your asynchronous endpoint. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Unlike other hosted models Amazon SageMaker supports, with Asynchronous Inference you can also scale down your asynchronous endpoints instances to zero. Requests that are received when there are zero instances are queued for processing once the endpoint scales up.
Refer documentation, samples and blogs here.
Related
What is the difference between AWS Batch and Sagemaker Training Job when using it for running docker image for Machine Learning training?
Both services are implementation of CaaS aka Container-as-a-Service. It means that you don't have to manage clusters and can only define launch configuration. And both services can be used for running training jobs in this regard once you have your docker image ready. Notable differences are:
[Operational complexity] AWS Batch has higher operational complexity then SageMaker training jobs. With the latter you don't need to provision any infrastructure - at most the role that is generated automatically. With the former you would need to deploy infrastructure, although you would definitely have a more refined control over it.
[Architecture] AWS Batch is less pure CaaS and closer to a managed cluster. It has a job queue and scales cluster based on job queue size while also places jobs on the machines. SageMaker training jobs starts VM per job and VM itself is abstracted from the user. So for example you could SSH into AWS Batch instance, but not SageMaker one.
[Docker image] SageMaker would require heavier customization of the docker container to make it work, but it does it so that you don't have to implement it yourself for thing like - passing hyperparameters, gathering metrics, and saving the model. AWS Batch just runs the container - so any associated business logic has to be implemented by the developer.
[Cost] Both AWS Batch and SageMaker training jobs are free aka you only pay for underlying infrastructure which was used. SageMaker training jobs uses ml.* instances which are ~10-20% more expensive then their on-demand counterparts (e.g. p2.xlarge costs $0.9 per hour and ml.p2.xlarge costs $1.125 per hour). Both services have a way of running the spot instances which would have lower cost.
So to summarize - AWS Batch is a more generalized and customizable tool, while SageMaker Training Jobs is a more focused one with more prebuilt features.
I'm running an application in the following mode
Trained a ML model in tensorflow
Created an API using Fast API and wrapped it around the ML model for inferencing.
Created a Dockerfile and containerized the whole application
Pushed the image to ECR
Created an EKS environment (2 Nodes - 1 GPU, 1 CPU)
Deployed to EKS as a k8s pod running the above image as a container inside the pod.
Enabled HPA (Horizontal Pod Autoscaling) to achieve scaling
We're able to achieve a high QPS (Query Per Second) ~15 QPS using the above architecture but we need to scale it to 50 for a client.
One approach is to add more nodes and scale the application using Node AutoScaling by AWS.
I was also looking into AWS Lambda functions, although intriguing, I can't really use them since one of my pod i.e API end-point needs to have a GPU enabled for fast inferencing.
I was wondering if there is a better approach of dealing this.
Is there any way to apply an autoscaling configuration to AWS Lambda provisioned concurrency using terraform?
I want to scale it up during peak hours, and ideally maintain an N+1 hot concurrency rate.
I looked here but found no reference to Lambdas: https://www.terraform.io/docs/providers/aws/r/appautoscaling_policy.html
The feature to control the auto-scaling of lambdas was added Dez.2019 (see this blog). As long as this is not available through Terraform you have a couple of options to work around this
Use a terraform provisioner to set up the provisioning rules through the aws-cli. Instructions which commands to run can be found in the AWS-Docs.
Invoke the lambda yourself from time to time to keep it warm, see e.g. this post or this stackoverflow question
Use a different service that provides more control, like ECS
I am curious to know what does the model.deploy command actually does in the background when implemented in aws sagemaker notebook
for eg :
predictor = sagemaker_model.deploy(initial_instance_count=9,instance_type='ml.c5.xlarge')
and also at the time of sagemaker endpoint autoscaling what is happening in the background, it is taking to long almost 10 minutes to launch a new-instances, by which most of the requests get dropped or not processed and also getting connection timeout while load testing threw JMeter. Is there any way to fast bootup or golden AMI kind of thing in sagemaker?
are there any other means by which this issue can be solved?
The docs mention what the deploy method does: https://sagemaker.readthedocs.io/en/stable/model.html#sagemaker.model.Model.deploy
You could also take a look at the source code here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L377
Essentially the deploy method hosts your model on a SageMaker Endpoint, launching the number of instances using the instance type that you specify. You can then invoke your model using a predictor: https://sagemaker.readthedocs.io/en/stable/predictors.html
For autoscaling, you may want to consider lowering your threshold for scaling out so that the additional instances start to be launched earlier. This page offers some good advice on how to determine the RPS your endpoint can handle. Specifically, you may want to have a lower SAFETY_FACTOR to ensure new instances are provisioned in time to handle your expected traffic.
This really wasn't clear for me in the Docs. And the console configuration is very confusing.
Will a Docker Cluster running in Fargate mode behind a Load Balancer shutdown and not charge me while it's not being used?
What about cold starts? Do I need to care about this in Fargate like in Lambda?
Is it less horizontal than Lambda? A lambda hooked to API Gateway will spawn a new function for every concurrent request, will Fargate do this too? Or will the load balancer decide it?
I've been running Flask/Django applications in Lambda for some time (Using Serverless/Zappa), are there any benefits in migrating them to Fargate?
It seems to be that it is more expensive than Lambda but if the Lambda limitations are not a problem then Lambda should always be the better choice right?
Will a Docker Cluster running in Fargate mode behind a Load Balancer shutdown and not charge me while it's not being used?
This will depend on how you configure your AutoScaling Group. If you allow it to scale down to 0 then yes.
What about cold starts? Do I need to care about this in Fargate like in Lambda?
Some good research has been done on this here: https://blog.cribl.io/2018/05/29/analyzing-aws-fargate/
But the takeaway is for smaller instances you shouldnt notice any more and ~40seconds time to get to a running state. For bigger ones this will take longer.
Is it less horizontal than Lambda? A lambda hooked to API Gateway will spawn a new function for every concurrent request, will Fargate do this too? Or will the load balancer decide it?
ECS will not create a new instance for every concurrent request,any scaling will be done off the AutoScaling group. The load balancer doesnt have any control over scaling, it will exclusively just balance load. However the metrics which it can give can be used to help determine if scaling is needed
I've been running Flask/Django applications in Lambda for some time (Using Serverless/Zappa), are there any benefits in migrating them to Fargate?
I havent used Flask or Django, but the main reason people tend to migrate over to serverless is to remove the need to maintain the scaling of servers, this inc managing instance types, cluster scheduling, optimizing cluster utilization
#abdullahkhawer i agree to his view on sticking to lambdas. Unless you require something to always be running and always being used 99% of the time lambdas will be cheaper than running a VM.
For a pricing example
1 t2.medium on demand EC2 instance = ~$36/month
2 Million invocations of a 256MB 3 second running lambda = $0.42/month
With AWS Fargate, you pay only for the amount of vCPU and memory resources that your containerized application requests from the time your container images are pulled until the AWS ECS Task (running in Fargate mode) terminates. A minimum charge of 1 minute applies. So, you pay until your Task (a group of containers) is running, more like AWS EC2 but on a per-minute basis and unlike AWS Lambda where you pay per request/invocation.
AWS Fargate doesn't spawn containers on every request as in AWS Lambda. AWS Fargate works by simply running containers on a fleet of AWS EC2 instances internally managed by AWS.
AWS Fargate now supports the ability to run tasks on a scheduled basis and in response to AWS CloudWatch Events. This makes it easier to launch and stop container services that you need to run only at a certain time to save money.
Keeping in mind your use case, if your applications are not making any problems in the production environment due to any AWS Lambda limitations then AWS Lambda is the better choice. If the AWS Lambda is being invoked too much (e.g., more than 1K concurrent invocations at every point of time) in the production environment, then go for AWS EKS or AWS Fargate as AWS Lambda might cost you more.