AWS API Gateway + Lamda - how to handle 1 million requests per second - amazon-web-services

we would like to create serverless architecture for our startup and we would like to support up to 1 million requests per second and 50 millions active users. How can we handle this use case with AWS architecture?
Regarding to AWS documentation API Gateway can handle only 10K requests/s and lamda can process 1K invocations/s and for us this is unacceptable.
How can we overcome this limitation? Can we request this throughput with AWS support or can we connect somehow to another AWS services (queues)?
Thanks!

Those numbers you quoted are the default account limits. Lambda and API Gateway can handle more than that, but you have to send a request to Amazon to raise your account limits. If you are truly going to receive 1 million API requests per second then you should discuss it with an AWS account rep. Are you sure most of those requests won't be handled by a cache like CloudFront?

The gateway is NOT your API Server. Lambda's are the bottleneck.
While the gateway can handle 100000 messages/sec (because it is going through a message queue), Lambdas top out at around 2,200 rps even with scaling (https://amido.com/blog/azure-functions-vs-aws-lambda-vs-google-cloud-functions-javascript-scaling-face-off/)
This differs dramatically from actually API framework implementations wherein the scale goes up to 3,500+ rps...

I think you should go with Application Load Balancer.
It is limitless in terms of RPS and can potentially be even cheaper for a large number of requests. It does have fewer integrations with AWS services though, but in general, it has everything you need for a gateway.
https://dashbird.io/blog/aws-api-gateway-vs-application-load-balancer/

Related

Multi region API/Lambda Architecture latency issue

We are trying to deploy our API Gateway/Lambda and route it through Route53 in following regions.
ap-south-1
Lambda
API Gateway + Certigicate for API Gateway + Custom Domain
us-east-1
Lambda
API Gateway + Certigicate for API Gateway + Custom Domain
DynamoDB
AWS Elastic Search Service
Our lambda(ap-south-1, us-east-1) connect to DynamoDB(us-east-1) and AWS Elastic search services(us-east-1) to fetch data.
When we test the lambda in us-east-1 it has 200ms of execution time.
But when we test the lambda in ap-south-1 it has around 3 seconds of execution time.
The logic is same in both the lambda. The only thing is it request dynamodb/Elastic Search service in us-east-1 from ap-south-1.
We want to understand why it takes around 3 seconds when lambda is executed from ap-south-1, since it is inter-region request in AWS Network infrastructure only.
What you are observing is a typical latency issue, since data store is too far from application.
And your architechure It is not Truly Multi-region. Even if you are in 2 region, your application is unusable if aws east goes down.
You should
Allow replication of dynamodb tables.
each lambda/applicaiton should hit only regional services and no cross region call.
Elastic search should be replicated using dynamodb streams.
If lambda is using sns and sqs, they should also hookup using dynamodb streams.
It will make sure
You will have low latency reads.
No issues if there is a regional outage.
But it will have issues like
Cost will be higher.
If writes are allowed from both the regions, race conditions might be there.
As others have already said it's probably a latency issue.
If you make multiple synchronous requests to a different region this latencies sums up.
To investigate further, you can try AWS X-Ray. Maybe it can give you some details on where latencies develop.
https://aws.amazon.com/it/xray/

How can I add ip-based rate limits with longer intervals on API Gateway?

I have an API Gateway endpoint that I would like to limit access to. For anonymous users, I would like to set both daily and monthly limits (based on IP address).
AWS WAF has the ability to set rate limits, but the interval for them is a fixed 5 minutes, which is not useful in this situation.
API Gateway has the ability to add usage plans with longer term rate quotas that would suit my needs, but unfortunately they seem to be based on API keys, and I don't see a way to do it by IP.
Is there a way to accomplish what I'm trying to do using AWS Services?
Is it maybe possible to use a usage plan and automatically generate an api key for each user who wants to access the api? Or is there some other solution?
Without more context on your specific use-case, or the architecture of your system, it is difficult to give a “best practice” answer.
Like most things tech, there are a few ways you could accomplish this. One way would be to use a combination of CloudWatch API logging, Lambda, DynamoDB (with Streams) and WAF.
At a high level (and regardless of this specific need) I’d protect my API using WAF and the AWS security automations quickstart, found here, and associate it with my API Gateway as guided in the docs here. Once my WAF is setup and associated with my API Gateway, I’d enable CloudWatch API logging for API Gateway, as discussed here. Now that I have things setup, I’d create two Lambdas.
The first will parse the CloudWatch API logs and write the data I’m interested in (IP address and request time) to a DynamoDB table. To avoid unnecessary storage costs, I’d set the TTL on the record I’m writing to my DynamoDB table to be twice whatever my analysis’s temporal metric is... ie If I’m looking to limit it to 1000 requests per 1 month, I’d set the TTL on my DynamoDB record to be 2 months. From there, my CloudWatch API log group will have a subscription filter that sends log data to this Lambda, as described here.
My second Lambda is going to be doing the actual analysis and handling what happens when my metric is exceeded. This Lambda is going to be triggered by the write event to my DynamoDB table, as described here. I can have this Lambda run whatever analysis I want, but I’m going to assume that I want to limit access to 1000 requests per month for a given IP. When the new DynamoDB item triggers my Lambda, the Lambda is going to query the DynamoDB table for all records that were created in the preceding month from that moment, and that contain the IP address. If the number of records returned is less than or equal to 1000, it is going to do nothing. If it exceeds 1000 then the Lambda is going to update the WAF WebACL, and specifically UpdateIPSet to reject traffic for that IP, and that’s it. Pretty simple.
With the above process I have near real-time monitoring of request to my API gateway, in a very efficient, cost-effective, scaleable manner in a way that can be deployed entirely Serverless.
This is just one way to handle this, there are definitely other ways you could accomplish this with say Kinesis and Elastic Search, or instead of logs you could analyze CloudTail events, or by using a third party solution that integrates with AWS, or something else.

Can/will AWS API Gateway -> Lambda performance be improved?

Has anyone found a solution to API Gateway latency issues?
With a simple function testing API Gateway -> Lambda interaction, I regularly see cold starts in the 2.5s range, and once "warmed," response times in the 900ms - 1.1s range are typical.
I understand the TLS handshake has its own overhead, but testing similar resources (AWS-based or general sites that I believe are not geo-distributed) from my location shows results that are half that, ~500ms.
Is good news coming soon from AWS?
(I've read everything I could find before posting.)
engineer with the API Gateway team here.
You said you've read "everything", but for context for others I want to link to a number of threads on our forums where I've documented publicly where a lot of this perceived latency when executing a single API call comes from:
Forum Post 1
Forum Post 2
In general, as you increase your call rates, your average latency will shrink as connection reuse mechanisms between your clients and CloudFront as well as between CloudFront and API Gateway can be leveraged. Additionally, a higher call rate will ensure your Lambda is "warm" and ready to serve requests.
That being said, we are painfully aware that we are not meeting the performance bar for a lot of our customers and are making strides towards improving this:
The Lambda team is constantly working on improving cold start times as well as attempting to remove them for functions that are seeing continuous load.
On API Gateway, we are currently in the process of rolling out improved connection reuse between CloudFront and API Gateway, where customers will be able to benefit from connections established via other APIs. This should mean that the percentage of requests that need to do a full TLS handshake between CloudFront and API Gateway should be reduced.

Is significant latency introduced by API Gateway?

I'm trying to figure out where the latency in my calls is coming from, please let me know if any of this information could be presented in a format that is more clear!
Some background: I have two systems--System A and System B. I manually (through Postman) hit an endpoint on System A that invokes an endpoint on System B.
System A is hosted on an EC2 instance.
When System B is hosted on a Lambda function behind API Gateway, the
latency for the call is 125 ms.
When System B is hosted on an
EC2 instance, the latency for the call is 8 ms.
When System B is
hosted on an EC2 instance behind API Gateway, the latency for the
call is 100 ms.
So, my hypothesis is that API Gateway is the reason for increased latency when it's paired with the Lambda function as well. Can anyone confirm if this is the case, and if so, what is API Gateway doing that increases the latency so much? Is there any way around it? Thank you!
It might not be exactly what the original question asks for, but I'll add a comment about CloudFront.
In my experience, both CloudFront and API Gateway will add at least 100 ms each for every HTTPS request on average - maybe even more.
This is due to the fact that in order to secure your API call, API Gateway enforces SSL in all of its components. This means that if you are using SSL on your backend, that your first API call will have to negotiate 3 SSL handshakes:
Client to CloudFront
CloudFront to API Gateway
API Gateway to your backend
It is not uncommon for these handshakes to take over 100 milliseconds, meaning that a single request to an inactive API could see over 300 milliseconds of additional overhead. Both CloudFront and API Gateway attempt to reuse connections, so over a large number of requests you’d expect to see that the overhead for each call would approach only the cost of the initial SSL handshake. Unfortunately, if you’re testing from a web browser and making a single call against an API not yet in production, you will likely not see this.
In the same discussion, it was eventually clarified what the "large number of requests" should be to actually see that connection reuse:
Additionally, when I meant large, I should have been slightly more precise in scale. 1000 requests from a single source may not see significant reuse, but APIs that are seeing that many per second from multiple sources would definitely expect to see the results I mentioned.
...
Unfortunately, while cannot give you an exact number, you will not see any significant connection reuse until you approach closer to 100 requests per second.
Bear in mind that this is a thread from mid-late 2016, and there should be some improvements already in place. But in my own experience, this overhead is still present and performing a loadtest on a simple API with 2000 rps is still giving me >200 ms extra latency as of 2018.
source: https://forums.aws.amazon.com/thread.jspa?messageID=737224
Heard from Amazon support on this:
With API Gateway it requires going from the client to API Gateway,
which means leaving the VPC and going out to the internet, then back
to your VPC to go to your other EC2 Instance, then back to API
Gateway, which means leaving your VPC again and then back to your
first EC2 instance.
So this additional latency is expected. The only way to lower the
latency is to add in API Caching which is only going to be useful is
if the content you are requesting is going to be static and not
updating constantly. You will still see the longer latency when the
item is removed from cache and needs to be fetched from the System,
but it will lower most calls.
So I guess the latency is normal, which is unfortunate, but hopefully not something we'll have to deal with constantly moving forward.
In the direct case (#2) are you using SSL? 8 ms is very fast for SSL, although if it's within an AZ I suppose it's possible. If you aren't using SSL there, then using APIGW will introduce a secure TLS connection between the client and CloudFront which of course has a latency penalty. But usually that's worth it for a secure connection since the latency is only on the initial establishment.
Once a connection is established all the way through, or when the API has moderate, sustained volume, I'd expect the average latency with APIGW to drop significantly. You'll still see the ~100 ms latency when establishing a new connection though.
Unfortunately the use case you're describing (EC2 -> APIGW -> EC2) isn't great right now. Since APIGW is behind CloudFront, it is optimized for clients all over the world, but you will see additional latency when the client is on EC2.
Edit:
And the reason why you only see a small penalty when adding Lambda is that APIGW already has lots of established connections to Lambda, since it's a single endpoint with a handful of IPs. The actual overhead (not connection related) in APIGW should be similar to Lambda overhead.

AWS Lambda using API Gateway error message

Everything was working yesterday and I'm simply still testing so my capacity shouldn't be high to begin with but I keep receiving these errors today:
{
Message = "We currently do not have sufficient capacity in the region you requested. Our system will be working on provisioning
additional capacity. You can avoid getting this error by temporarily
reducing your request rate.";
Type =Service;
}
What is this error message and should I be concerned that something like this would happen when I go into production? This is a serious error because my users are mandated to login using calls to api gateway (utilizing aws lambda).
This kind of error should not last long as it will immediately trigger AWS provision request.
If you concern about your api gateway availbility, consider to create redundant lambda function on other regions and switch whenever this error occurs. However calling lambda from a remote region can introduce long latency.
Another suggestion is, please review the aws limits for API gateway and Lambda services in your account. If your requests do exceed the limits, raise ticket to aws to extend it.
Amazon API Gateway Limits
Resource Default Limit
Maximum APIs per AWS account 60
Maximum resources per API 300
Maximum labels per API 10
Increase the limits is free service in aws.
Refer: Amazon API Gateway Limits
AWS Lambda posted an event on the service health dashboard, so please follow this for further details on that specific issue.
Unfortunately, if you want to return a custom code when Lambda errors in this way you would have to write a mapping template and attach it to every integration response where you used a Lambda integration.
We recognize that this is suboptimal and is work most customers would prefer API Gateway just handle for them. With that in mind, we already have a high priority item on our backlog to make it easier to pass through the status codes from the Lambda integration. I cannot, however, commit to a timeframe as to when this would be available.