Sagemaker Endpoint throttling exception - endpoint

I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?

you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.

Related

Difference between Compute optimized instances and Accelerated computing instances

I am new to AWS and came across these two instances but I don't understand the need for the difference between them they seem the same can Anyone explain this
Accelerated computing instances
Accelerated computing instances use hardware accelerators, or coprocessors, to perform some functions more efficiently than is possible in software running on CPUs. Examples of these functions include floating-point number calculations, graphics processing, and data pattern matching.
In computing, a hardware accelerator is a component that can expedite data processing. Accelerated computing instances are ideal for workloads such as graphics applications, game streaming, and application streaming.
Compute-optimized instances
Compute-optimized instances are ideal for compute-bound applications that benefit from high-performance processors. Like general-purpose instances, you can use compute-optimized instances for workloads such as web, application, and gaming servers.
However, the difference is computed optimized applications are ideal for high-performance web servers, compute-intensive applications servers, and dedicated gaming servers. You can also use compute-optimized instances for batch processing workloads that require processing many transactions in a single group.
The main difference between both of them is the hardware. Accelerated computing instances have GPUs which are accelerators whereas the compute optimized have good high performance processors. Though both of them provide high compute power, the way in which they are providing is different.
https://aws.amazon.com/ec2/instance-types/
Compute Optimized instances are the instance designed for the applications that inturn gives the benefit as high compute power. These applications includes compute-intensive applications e.g. high-performance web servers, high-performance computing (HPC), distributed analytics, scientific modelling and ML inference
To know more about Compute Optimized instances you can refer this
link.
If you require high processing capability, you'll benefit from using accelerated computing instances, which provide access to hardware-based compute accelerators such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), or AWS Inferentia.
To know more about Accelerated Computing you can refer this link.

how to use two aws ec2 instances(1 gpu and 1 cpu instance) with one storage to(run code, store/share files) & reduce cost

My team is using a gpu instance to run machine learning tensorflow based, yolo,computer vision applications and use it for training machine learning models also.. It costs 7$ an hour and has 8 gpu's. Was trying to reduce costs on it. We need 8 gpu's for faster training and sometimes many people can use different gpu's at the same time.
For our use case we are not using sometimes the gpu's(8 gpus) at all for atleast 1-2 weeks of a month. But a use of the gpu may arrive during that time but maynot also. So i wanted to know is there a way to edit the code and do all cpu intensive operations when gpu not needed through a low cost cpu instance. And turn on the gpu instance only when needed use it and then stop it when work done.
I thought of using efs for putting code on the shared file system and then running from there but i read an article( https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs ) where its written that i should never run code from network based drives because the speed can become really slow. So i dont know if its good to run machine learning application from efs file system. I was thinking of making virtual environments on folders in efs but i dont think that is a good idea.
Could anyone suggest good ways of achieving this and reduce costs. And if you are suggesting to use an instance with lower number of gpu's that i have considered but we sometimes need 8 gpu's for faster training but we dont use the gpus at all for 1-2 weeks but the costs are still incurred.
Please suggest a way on how to achieve a low cost for this use case without using spot or reserved instances.
Thanks in advance
A few thoughts:
GPU instances now allow hibernation, so when launching your GPU select the new Stop Instance behavior 'hibernate' which will let you turn it off for 2 weeks but spin it up quickly if necessary
If you only have one instance, look into using EBS for data storage with a high volume of provisioned iops to move data on/off your instance quickly
Alternately, move your model to Sagemaker to ensure you are only charged for GPU use when you are actively training your model
If you are applying your model (inferencing) move that workload to a cheap instance. A trained yolo model can run inferencing on very small CPU instances, no need for a GPU for that part of the workload at all.
To reduce inference costs, you can use Elastic Inference which supports pay-per-use functionality:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-inference.html

Amazon EC2 Auto Scaling test

I created an Amazon EC2 Auto Scaling group, where it should have at least 1 server all the time.
Add up 2 servers when CPU utilization passes beyond 80%
Terminate 2 servers when CPU utilization comes down less then 30%
Challenge here is, How should I increase/decrease CPU utilization? I cannot connect to any instance or use CLI since I am in Office system / restricted AWS access.
Is there a way to test this despite of these restrictions?
There is a way to stress test an instance or container (assuming it is Linux based) using Stress, a package that is designed to crank up the CPU for a specified amount of time and then bring the CPU percentages down after the specified amount of time. It has other parameters to customize the testing.
My personal favorite tool for testing system response and DR is to use Netflix's ChaosMonkey. It is an open source project, backed by Netflix that is designed to test fault tolerance. Using it in production comes down to personal preference, but it is a tool for testing systems.
If you want to test the "real" situation, then you will need a way to generate load on the system.
This could be artificial load (eg triggering a program that does calculations, just to spin the CPU) or a real-world simulation of actual activities that you system will perform.
There is no need to test whether Amazon EC2's Auto Scaling actually works — there would be issues shown on the AWS Status Page if that were the case — so I presume you just wish to test your own configuration. In this case, you should really be testing a real world scenario, such as simulating a quantity of simultaneous users doing typical activities that users would perform.
If you do any other form of testing (such as fake increasing of CPU load), you're not really testing the real situation in which you want Auto Scaling to perform, so the results of your test won't actually be useful.
For example, it might be that your application runs into memory issues or single-threading issues way before it hits any CPU limits. That would be something you'd really like to know before throwing real users at your system.

How to do or perform

Ho would you do the below
Please share your thoughts.
There are many ways to perform stress test on AWS architecture some of them are Jmeter, Blazemeter etc. Regarding restriction you have to let AWS support know before hand regarding the stress test or penetration test you are going to perform on the AWS infrastructure you have created. Check this link for more details.
https://aws.amazon.com/security/penetration-testing/
Since you pay for bytes into and out of the Amazon infrastructure, to maintain low costs keep your load generators in the same data center as your application under test. There are some drawbacks to this, but the primary one is that your network will lack the complexity of impairment that real users will experience. If you are using a tool which includes modeling of network impairment with the virtual users then this drawback is reduced.
No matter what tool you use, if you have the load generators running on virtual machines in AWS, you will face the issue of clock float on the virtual machine clock. Periodically this virtual clock will need to be resync'd to the system clock on the hypervisor host. This will result in clock jump. This will happen while you have a timing record open - it is unavoidable. The net impact of this is that you will have higher average, percentile values, standard deviation and maximums than if you were running on physical hardware.

Can I improve performance of my GCE small instance?

I'm using cloud VPS instances to host very small private game servers. On Amazon EC2, I get good performance on their micro instance (1 vCPU [single hyperthread on a 2.5GHz Intel Xeon], 1GB memory).
I want to use Google Compute Engine though, because I'm more comfortable with their UX and billing. I'm testing out their small instance (1 vCPU [single hyperthread on a 2.6GHz Intel Xeon], 1.7GB memory).
The issue is that even when I configure near-identical instances with the same game using the same settings, the AWS EC2 instances perform much better than the GCE ones. To give you an idea, while the game isn't Minecraft I'll use that as an example. On the AWS EC2 instances, succeeding world chunks would load perfectly fine as players approach the edge of a chunk. On the GCE instances, even on more powerful machine types, chunks fail to load after players travel a certain distance; and they must disconnect from and re-login to the server to continue playing.
I can provide more information if necessary, but I'm not sure what is relevant. Any advice would be appreciated.
Diagnostic protocols to evaluate this scenario may be more complex than you want to deal with. My first thought is that this shared core machine type might have some limitations in consistency. Here are a couple of strategies:
1) Try backing into the smaller instance. Since you only pay for 10 minutes, you could see if the performance is better on higher level machines. If you have consistent performance problems no matter what the size of the box, then I'm guessing it's something to do with the nature of your application and the nature of their virtualization technology.
2) Try measuring the consistency of the performance. I get that it is unacceptable, but is it unacceptable based on how long it's been running? The nature of the workload? Time of day? If the performance is sometimes good, but sometimes bad, then it's probably once again related to the type of your work load and their virtualization strategy.
Something Amazon is famous for is consistency. They work very had to manage the consistency of the performance. it shouldn't spike up or down.
My best guess here without all the details is you are using a very small disk. GCE throttles disk performance based on the size. You have two options ... attach a larger disk or use PD-SSD.
See here for details on GCE Disk Performance - https://cloud.google.com/compute/docs/disks
Please post back if this helps.
Anthony F. Voellm (aka Tony the #p3rfguy)
Google Cloud Performance Team