I have been given the business logic of
A customer makes a request for services through a third party
gateway GUI to an EC2 instance
Processing for some time (15hr)
Data retrieval
Currently this is implemented by statically giving each user an EC2 instance to use to handle their requests. (This instance actually creates some sub instances to parallel process the data).
What should happen is that for each request, an EC2 instance be fired off automatically.
In the long term, I was thinking that this should be done using SWF (given the use of sub processes), however, I wondered if as a quick and dirty solution, using Autoscaling with the correct settings is worthwhile pursuing.
Any thoughts?
you can "trick" autoscalling to spin up instances based on metrics:
http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/policy_creating.html
So on each request, keep track/increments a metric. Decrement the metric when the process completes. Drive the autoscalling group on the metric.
Use Step Adjustments to control the number of instances: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-scale-based-on-demand.html#as-scaling-types
Interesting challenges: binding customers to specific EC2 instances. Do you have this hard requirement of giving each customer their own instance? Sounds like autos calling is actually better suited for the paralleling process of the actual data, not for requests routing. You may get away with having a fixed number of machines for this and/or scale based on traffic, not number of customers.
Related
While I have worked with AWS for a bit, I'm stuck on how to correctly approach the following use case.
We want to design an uptime monitor for up to 10K websites.
The monitor should run from multiple AWS regions and ping websites if they are available and measure the response time. With a lambda function, I can ping the site, pass the result to a sqs queue and process it. So far, so good.
However, I want to run this function every minute. I also want to have the ability to add and delete monitors. So if I don't want to monitor website "A" from region "us-west-1" I would like to do that. Or the other way round, add a website to a region.
Ideally, all this would run serverless and deployable to custom regions with cloud formation.
What services should I go with?
I have been thinking about Eventbridge, where I wanted to make custom events for every website in every region and then send the result over SNS to a central processing Lambda. But I'm not sure this is the way to go.
Alternatively, I wanted to build a scheduler lambda that fetches the websites it has to schedule from a DB and then invokes the fetcher lambda. But I was not sure about the delay since I want to have the functions triggered every minute. The architecture should monitor 10K websites and even more if possible.
Feel free to give me any advise you have :)
Kind regards.
In my opinion Lambda is not the correct solution for this problem. Your costs will be very high and it may not scale to what you want to ultimately do.
A c5.9xlarge EC2 costs about USD $1.53/hour and has a 10gbit network. With 36 CPU's a threaded program could take care of a large percentage - maybe all 10k - of your load. It could still be run in multiple regions on demand and push to an SQS queue. That's around $1100/month/region without pre-purchasing EC2 time.
A Lambda, running 10000 times / minute and running 5 seconds every time and taking only 128MB would be around USD $4600/month/region.
Coupled with the management interface you're alluding to the EC2 could handle pretty much everything you're wanting to do. Of course, you'd want to scale and likely have at least two EC2's for failover but with 2 of them you're still less than half the cost of the Lambda. As you scale now to 100,000 web sites it's a matter of adding machines.
There are a ton of other choices but understand that serverless does not mean cost efficient in all use cases.
We were trying to implement an elastic scaling application on AWS. But currently, due to the complexity of the application process, I have an issue with the current routing algorithm.
In the application when we send a request (a request to a complex calculation). We immediately send a token to the user and start calculating. So the user can return with the token any day and access those calculated results. When there are more calculation requests they will be in a queue and get executed 2 by 2 as one calculation takes a considerable amount of CPU. As you can see, in this specific scenario.
The application active connection count is very low as we respond to the user with the token as soon as we get the request.
CPU usage will look normal as we do calculations 2 by 2
Considering these facts, with the load balancer routing we are facing a problem of elastic instances terminating before the full queue is finished calculating and the queue grows really long as the load balancer does not have any idea about the queued requests.
To solve it, either we need to do routing manually, or we need to find a way to let the load balancer know the queued request count (maybe with an API call). If you have an idea of how to do this please help me. (I'm new to AWS)
Any idea is welcome.
Based on the comments.
An issue observed with the original approach was premature termination of instances since they their scale-in/out is based on CPU utilization only.
A proposed solution to rectify the issue based the scaling activities on the length of the job queue. En example of such a solution is shown in the following AWS link:
Using Target Tracking with the Right Metric
In the example, the scaling is based on the following metric:
The solution is to use a backlog per instance metric with the target value being the acceptable backlog per instance to maintain.
I have created a model endpoint which is InService and deployed on an ml.m4.xlarge instance. I am also using API Gateway to create a RESTful API.
Questions:
Is it possible to have my model endpoint only Inservice (or on standby) when I receive inference requests? Maybe by writing a lambda function or something that turns off the endpoint (so that it does not keep accumulating the per hour charges)
If q1 is possible, would this have some weird latency issues on the end users? Because it usually takes a couple of minutes for model endpoints to be created when I configure them for the first time.
If q1 is not possible, how would choosing a cheaper instance type affect the time it takes to perform inference (Say I'm only using the endpoints for an application that has a low number of users).
I am aware of this site that compares different instance types (https://aws.amazon.com/sagemaker/pricing/instance-types/)
But, does having a moderate network performance mean that the time to perform realtime inference may be longer?
Any recommendations are much appreciated. The goal is not to burn money when users are not requesting for predictions.
How large is your model? If it is under the 50 MB size limit required by AWS Lambda and the dependencies are small enough, there could be a way to rely directly on Lambda as an execution engine.
If your model is larger than 50 MB, there might still be a way to run it by storing it on EFS. See EFS for Lambda.
If you're willing to wait 5-10 minutes for SageMaker to launch, you can accomplish this by doing the following:
Set up a Lambda function (or create a method in an existing function) to check your endpoint status when the API is called. If the status != 'InService', call the function in #2.
Create another method that when called launches your endpoint and creates a metric alarm in Cloudwatch to monitor your primary lambda function's invocations. When the threshold falls below your desired invocations / period, it will call the function in #3.
Create a third method to delete your endpoint and the alarm when called. Technically, the alarm can't call a Lambda function, so you'll need to create a topic in SNS and subscribe this function to it.
Good luck!
I have 2 instances, connected to a load balancer. I would like to stop 1 instance, and start it only when a certain alarm happens, for example when the first intance has a high CPU load.
I couldn't find how to do it. in the Auto scaling group, i see i can launch a brand new instance, but that's not what i want, i want a specific instance to start.
I couldn't find how to connect an alert to an action - wake up this specific instance.
Should this be done in the load balancer configuration? i couldn't find how...
This is not really the way autoscaling is supposed to work, and hence the solution to your particular problem is a little bit more complex than simply using autoscaling to create new instances in response to metric thresholds being crossed. It may be worth asking yourself exactly why you need to do it this way and whether it could be achieved in the usual way.
In order to achieve starting (and stopping) a particular instance you'll need three pieces:
A CloudWatch Alarm triggered by the metric you need (CPUUtilization) crossing your desired threshold.
An SNS topic that is triggered by the alarm in the previous step.
A lambda function (with the correct IAM permissions) that is subscribed to the SNS topic, which sends the relevant API calls to EC2 to start or stop the instances when the notification from SNS arrives. You can find some examples of the code needed to do this eg here in node.js and here from AWS although there are probably others if you prefer another language.
Once you put all these together you should be able to respond to a change in CPU with starting and stopping particular instances.
We are hosting a sale every month. Once we are ready with all the deals data we send a notification to all of our users. As a result of that we get huge traffic with in seconds and it lasts for about an hour. Currently we are changing the instance class type to F4_1G before the sale time and back to F1 after one hour. Is there a better way to handle this?
A part from changing the instance class of App Engine Standard based on the expected demand that you have, you can (and should) also consider a good scaling approach for your application. App Engine Standard offers three different scaling types, which are documented in detail, but let me summarize their main features here:
Automatic scaling: based on request rate, latency in the responses and other application metrics. This is probably the best option for the use case you present, as more instances will be spun up in response to demand.
Manual scaling: continuously running, instances preserve state and you can configure the exact number of instances you want running. This can be useful if you already know how to handle your demand from previous occurrences of the spikes in usage.
Basic scaling: the number of instances scales with the volume demand, and you can set up the maximum number of instances that can be serving.
According to the use case you presented in your question, I think automatic scaling is the scaling type that matches your requirements better. So let me get a little more in-depth on the parameters that you can tune when using it:
Concurrent requests to be handled by each instance: set up the maximum number of concurrent requests that can be accepted by an instance before spinning up a new instance.
Idle instances available: how many idle (not serving traffic) instances should be ready to handle traffic. You can tune this parameter to be higher when you have the traffic spike, so that requests are handled in a short time without having to wait for an instance to be spun up. After the peak you can set it to a lower value to reduce the costs of the deployment.
Latency in the responses: the allowed time that a request can be left in the queue (when no instance can handle it) without starting a new instance.
If you play around with these configuration parameters, you can define in a very deterministic manner the amount of instances that you want to have, being able to accommodate the big spikes and later returning to lower values in order to decrease the usage and cost.
An additional note that you should take into account when using automatic scaling is that, after a traffic load spike, you may see more idle instances than you specified (they are not torn down in order to avoid that new instances must be started), but you will only be billed for the max_idle_instances that you specified.