Istio extension to provide session affinity among replicas without using consistent hash

Istio extension to provide session affinity among replicas without using consistent hash - istio

I have microservices that need to be stateful. I also want the microservices to be scalable such that replicas can be added or removed as needed, primarily for auto-scaling. What I'm looking for is either an Istio extension that would support this, or an example Istio extension that I could start with to create such an extension. My thoughts are that when a replica gets a request, if subsequent requests need to come back to that same replica, I'd have the microservice include it's hostname(?) in a response header. The caller would pass that header back in subsequent requests and the extension would notice that header and alter the routing to route the request to the specific replica that needs to receive the request.
I've looked at the consistent hashing mechanism Istio provides, but that won't yield consistent routing as far as I can tell when the number of replicas changes.
Certainly others must have a similar requiremnt.

Related

Log requests/responses in API gateway through Labmbda proxy

I would like to log the complete requests+responses incl. body received on API gateway in an AWS Lambda proxy while passing on the requests for processing to a different server (as in reverse proxy the requests). Because the standard logging from API Gateway to CloudWatch truncates requests/responses after 1024 bytes, I cannot use this option.
So the processing would look like this:
Request -> API Gateway -> Lambda to log full request incl. body -> public API endpoint -> Response -> Lambda to log full response incl. body -> API Gateway -> Response
Is there a known solution for this scenario ?

You probably have a good reason for that, but make sure you aware that logging full request/response bodies may have a lot of undesired implications.
For example, if your service requires GDPR compliance, this is a huge issue. Also, it may greatly affect the performance and make you run into some issues related to quotas. Basically, it is usually not a good idea doing that.
Storing these logs on cloudwatch would be the easiest option for requests were that 1K limit is not an issue. If you have just a bunch of requests where this is not enough, you could consider treating them as an exception.
You could use S3 / DynamoDB / Elastic Search, depending on what you want to do with them, and there are tradeoffs as well.
S3 - This would allow to store very large requests/responses, but it can create a lot of fragmentation. You may end up with a lot of small files and also you need to some sort of index (probably storing the S3 key in cloudwatch logs). Searching can be somewhat painful in this case (although you may be able to use Athena depending on how do you store it).
DynamoDB - Easy to store, but you can run to a lot of quota limits if your API access is too frequent. You may need to bump your costs a lot to prevent them. Also, each record has a limit of 400Kb. I personally don't recommend this approach.
ElasticSearch - Default record size limit is 100Mb but this can be increased. It would make it easy to query this data later on.
I'd say ElasticSearch can be more appropriate for this case given the amount of information this thread has. Also, depending on the volume, some of these solutions would eventually require some publish-subscribe mechanism (ex. Kinesis) in between to handle burst limits and message grouping (if you are S3 for example, you may want to group multiple entries in a single file)

How should microservices developed using AWS API Gateway + Lambda/ECS talk?

I am developing a "micro-services" application using AWS API Gateway with either Lambda or ECS for compute. The issue now is communication between services are via API calls through the API gateway. This feels inefficient and less secure than it can be. Is there a way to make my microservices talk to each other in a more performant and secure manner? Like somehow talk directly within the private network?
One way I thought of is multiple levels of API gateway.
1 public API gateway
1 private API gateway per microservice. And each microservice can call another microservice "directly" inside the private network
But in this way, I need to "duplicate" my routes in 2 levels of API ... this does not seem ideal. I was thinking maybe use {proxy+}. So anything /payment/{proxy+} goes to payment API gateway and so on - theres still 2 levels of API gateway ... but this seem to be the best I can go?
Maybe there is a better way?

There are going to be many ways to build micro-services. I would start by familiarizing yourself with the whitepaper AWS published: Microservices on AWS, Whitepaper - PDF version.
In your question you stated: "The issue now is communication between services are via API calls through the API gateway. This feels inefficient and less secure than it can be. Is there a way to make my microservices talk to each other in a more performant and secure manner?"
Yes - In fact, the AWS Whitepaper, and API Gateway FAQ reference the API Gateway as a "front door" to your application. The intent of API Gateway is to be used for external services communicating to your AWS services.. not AWS services communicating with each other.
There are several ways AWS resources can communicate with each other to call micro-services. A few are outlined in the whitepaper, and this is another resource I have used: Better Together: Amazon ECS and AWS Lambda. The services you use will be based on the requirements you have.
By breaking monolithic applications into small microservices, the communication overhead increases because microservices have to talk to each other. In many implementations, REST over HTTP is used as a communication protocol. It is a light-weight protocol, but high volumes can cause issues. In some cases, it might make sense to think about consolidating services that send a lot of messages back and forth. If you find yourself in a situation where you consolidate more and more of your services just to reduce chattiness, you should review your problem domains and your domain model.
To my understanding, the root of your problem is routing of requests to micro-services. To maintain the "Characteristics of Microservices" you should choose a single solution to manage routing.
API Gateway
You mentioned using API Gateway as a routing solution. API Gateway can be used for routing... however, if you choose to use API Gateway for routing, you should define your routes explicitly in one level. Why?
Using {proxy+} increases attack surface because it requires routing to be properly handled in another micro-service.
One of the advantages of defining routes in API Gateway is that your API is self documenting. If you have multiple API gateways it will become colluded.
The downside of this is that it will take time, and you may have to change existing API's that have already been defined. But, you may already be making changes to existing code base to follow micro-services best practices.
Lambda or other compute resource
Despite the reasons listed above to use API Gateway for routing, if configured properly another resource can properly handle routing. You can have API Gateway proxy to a Lambda function that has all micro-service routes defined or another resource within your VPC with routes defined.
Result
What you do depends on your requirements and time. If you already have an API defined somewhere and simply want API Gateway to be used to throttle, monitor, secure, and log requests, then you will have API Gateway as a proxy. If you want to fully benefit from API Gateway, explicitly define each route within it. Both approaches can follow micro-service best practices, however, it is my opinion that defining each public API in API Gateway is the best way to align with micro-service architecture. The other answers also do a great job explaining the trade-offs with each approach.

I'm going to assume Lambdas for the solution but they could just as well be ECS instances or ELB's.
Current problem
One important concept to understand about lambdas before jumping into the solution is the decoupling of your application code and an event_source.
An event source is a different way to invoke your application code. You mentioned API Gateway, that is only one method of invoking your lambda (an HTTP REQUEST). Other interesting event sources relevant for your solution are:
Api Gateway (As noticed, not effective for inter service communication)
Direct invocation (via AWS Sdk, can be sync or async)
SNS (pub/sub, eventbus)
There are over 20+ different ways of invoking a lambda. documentation
Use case #1 Sync
So, if your HTTP_RESPONSE depends on one lambda calling another and on that 2nd lambdas result. A direct invoke might be a good enough solution to use, this way you can invoke the lambda in a synchronous way. It also means, that lambda should be subscribed to an API Gateway as an event source and have code to normalize the 2 different types of events. (This is why lambda documentation usually has event as one of the parameters)
Use case #2 Async
If your HTTP response doesn't depend on the other micro services (lambdas) execution. I would highly recommend SNS for this use case, as your original lambda publishes a single event and you can have more than 1 lambda subscribed to that event execute in parallel.
More complicated use cases
For more complicated use cases:
Batch processing, fan-out pattern example #1 example #2
Concurrent execution (one lambda calls next, calls next ...etc) AWS Step functions

There are multiple ways and approaches for doing this besides being bound to your current setup and infrastructure without excluding the flexibility to implement/modify the existing code base.
When trying to communicate between services behind the API Gateway is something that needs to be carefully implemented to avoid loops, exposing your data or even worst, blocking your self, see the "generic" image to get a better understanding:
While using HTTP for communicating between the services it is often common to see traffic going out the current infrastructure and then going back through the same API Gateway, something that could be avoided by just going directly the other service in place instead.
In the previous image for example, when service B needs to communicate with service A it is advisable to do it via the internal (ELB) endpoint instead of going out and going back through the API gateway.
Another approach is to use "only" HTTP in the API Gateway and use other protocols to communicate within your services, for example, gRPC. (not the best alternative in some cases since depends on your architecture and flexibility to modify/adapt existing code)
There are cases in where your infrastructure is more complex and you may not communicate on demand within your containers or the endpoints are just unreachable, in this cases, you could try to implement an event-driven architecture (SQS and AWS Lambda)
I like going asynchronous by using events/queues when possible, from my perspective "scales" better and must of the services become just consumers/workers besides no need to listen for incoming request (no HTTP needed), here is an article, explaining how to use rabbitmq for this purpose communicating microservices within docker
These are just some ideas that hope could help you to find your own "best" way since is something that varies too much and every scenario is unique.

I don't think your question is strictly related to AWS but more like a general way of communication between the services.
API Gateway is used as an edge service which is a service at your backend boundary and accessible by external parties. For communication behind the API Gateway, between your microservices, you don't necessary have to go through the API Gateway again.
There are 2 ways of communication which I'd mention for your case:
HTTP
Messaging
HTTP is the most simplistic way of communication as it's naturally easier to understand and there are tons of libraries which makes it easy to use.
Despite the fact of the advantages, there are a couple of things to look out for.
Failure handling
Circuit breaking in case a service is unavailable to respond
Consistency
Retries
Using service discovery (e.g. Eureka) to make the system more flexible when calling another service
On the messaging side, you have to deal with asynchronous processing, infrastructure problems like setting up the message broker and maintaining it, it's not as easy to use as pure HTTP, but you can solve consistency problems with just being eventually consistent.
Overall, there are tons of things which you have to consider and everything is about trade-offs. If you are just starting with microservices, I think it's best to start with using HTTP for communication and then slowly going to the messaging alternative.
For example in the Java + Spring Cloud Netflix world, you can have Eureka with Feign and with that it's really easy to use logical address to the services which is translated by Eureka to actual IP and ports. Also, if you wanna use Swagger for your REST APIs, you can even generate Feign client stubs from it.

I've had the same question on my mind for a while now and still cannot find a good generic solutions... For what it's worth...
If the communication is one way and the "caller" does not need to wait for a result, I find Kinesis streams very powerful - just post a "task" onto the stream and have the stream trigger a lambda to process it. But obviously, this works in very limited cases...
For the response-reply world, I call the API Gateway endpoints just like an end user would (with the added overhead of marshaling and unmarshaling data to "fit" in the HTTP world, and unnecessary multiple authentications).
In rare cases, I may have a single backend lambda function which gets invoked by both the Gateway API lambda and other microservices directly. This adds an extra "hop" for "end users" (instead of [UI -> Gateway API -> GatewayAPI lambda], now I have [UI -> Gateway API -> GatewayAPI lambda -> Backend lambda]), but makes microservice originated calls faster (since the call and all associated data no longer need to be "tunneled" through an HTTP request). Plus, this makes the architecture more complicated (I no longer have a single official API, but now have a "back channel" direct dependencies).

Routing traffic to specific AWS regions using wildcard subdomain

I'm building a Laravel application that offers an authoring tool to customers. Each customer will get their own subdomain i.e:
customer-a.my-tool.com
customer-b.my-tool.com
My tool is hosted on Amazon in multiple regions for performance but mostly for privacy law reasons(GDPR++). Each customer have their data in only one region. Australian customers in Australia, European in Europe etc. So the customers users must be directed to the correct region. If a European user ends up being served by the US region their data won't be there.
We can solve this manually using DNS and simply point each sub-domain to the correct IP, but we don't want to do this for two reasons. (1) updating the DNS might take up to 60 seconds. We don't want the customer to wait. (2) It seems the sites we've researched uses wildcard domains. For instance slack and atlassian.net. We know that atlassian.net also have multiple regions.
So the question is:
How can we use a wildcard domain and still route the traffic to the regions where the content is located?
Note:
We don't want the content in all regions, but we can have for instance a DynamoDB available in all regions mapping subdomains to regions.
We don't want to tie an organization to a region. I.e. a domain structure like customer-a.region.my-tool.com is an option we've considered, but discarded
We, of course, don't want to be paying for transferring the data twice, and having apps in all regions accessing the databases in the regions the data belong to is not an option since it will be slow.

How can we use a wildcard domain and still route the traffic to the regions where the content is located?
It is, in essence, not possible to do everything you are trying to do, given all of the constraints you are imposing: automatically, instantaneously, consistently, and with zero overhead, zero cost, and zero complexity.
But that isn't to say it's entirely impossible.
You have asserted that other vendors are using a "wildcard domain," which is a concept that is essentially different than I suspect you believe it necessarily entails. A wildcard in DNS, like *.example.com is not something you can prove to the exclusion of other possibilities, because wildcard records are overridden by more specific records.
For a tangible example that you can observe, yourself... *.s3.amazonaws.com has a DNS wildcard. If you query some-random-non-existent-bucket.s3.amazonaws.com, you will find that it's a valid DNS record, and it routes to S3 in us-east-1. If you then create a bucket by that name in another region, and query the DNS a few minutes later, you'll find that it has begun returning a record that points to the S3 endpoint in the region where you created the bucket. Yes, it was and is a wildcard record, but now there's a more specific record that overrides the wildcard. The override will persist for at least as long as the bucket exists.
Architecturally, other vendors that segregate their data by regions (rather than replicating it, which is another possibility, but not applicable to your scenario) must necessarily be doing something along one of these lines:
creating specific DNS records and accepting the delay until the DNS is ready or
implementing what I'll call a "hybrid" environment that behaves one way initially, and a different way eventually, this evironment uses specific DNS records to override a wildcard and has an ability to temporarily deliver, via a reverse proxy, a misrouted request to the correct cluster, to allow instantaneous correct behavior until the DNS propagates or
an ongoing "two-tier" environment, using a wildcard without more specific records to override it, operating a two-tier infrastructure, with an outer tier that is distributed globally, that accepts any request, and has internal routing records that deliver the request to an inner tier -- the correct regional cluster.
The first option really doesn't seem unreasonable. Waiting a short time for your own subdomain to be created seems reasonably common. But, there are other options.
The second option, the hybrid environment, would simply require that the location where your wildcard points to by default be able to do some kind of database lookup to determine where the request should go, and proxy the request there. Yes, you would pay for inter-region transport, if you implement this yourself in EC2, but only until the DNS update takes effect. Inter-region bandwidth between any two AWS regions costs substantially less than data transfer to the Internet -- far less than "double" the cost.
This might be accomplished in any number of ways that are relatively straightforward.
You must, almost by definition, have a master database of the site configuration, somewhere, and this system could be queried by a complicated service that provides the proxying -- HAProxy and Nginx both support proxying and both support Lua integrations that could be used to do a lookup of routing information, which could be cached and used as long as needed to handle the temporarily "misrouted" requests. (HAProxy also has static-but-updatable map tables and dynamic "stick" tables that can be manipulated at runtime by specially-crafted requests; Nginx may offer similar things.)
But EC2 isn't the only way to handle this.
Lambda#Edge allows a CloudFront distribution to select a back-end based on logic -- such as a query to a DynamoDB table or a call to another Lambda function that can query a relational database. Your "wildcard" CloudFront distribution could implement such a lookup, caching results in memory (container reuse allows very simple in-memory caching using simply an object in a global varible). Once the DNS record propagates, the requests would go directly from the browser to the appropriate back-end. CloudFront is marketed as a CDN, but it is in fact a globally-distributed reverse proxy with an optional response caching capability. This capability may not be obvious at first.
In fact, CloudFront and Lambda#Edge could be used for such a scenario as yours in either the "hybrid" environment or the "two-tier" environment. The outer tier is CloudFront -- which automatically routes requests to the edge on the AWS network that is nearest the viewer, at which point a routing decision can be made at the edge to determine the correct cluster of your inner tier to handle the request. You don't pay for anything twice, here, since bandwidth from EC2 to CloudFront costs nothing. This will not impact site performance other than the time necessary for thst initial database lookup, and once your active containers have that cached the responsiveness of the site will not be impaired. CloudFront, in general, improves responsiveness of sites even when most of the content is dynamic, because it optimizes both the network path and protocol exchanges between the viewer and your back-end, with optimized TCP stacks and connection reuse (particularly helpful at reducing the multiple round-trips required by TLS handshakes).
In fact, CloudFront seems to offer an opportunity to have it both ways -- an initially hybrid capability that automatically morphs into a two-tier infrastructure -- because CloudFront distributions also have a wildcard functionality with overrides: a distribution with *.example.com handles all requests unless a distribution with a more specific domain name is provisioned -- at which point the other distribution will start handling the traffic. CloudFront takes a few minutes before the new distribution overrides the wildcard, but when the switchover happens, it's clean. A few minutes after the new distribution is configured, you make a parallel DNS change to the newly assigned hostname for the new distribution, but CloudFront is designed in such a way that you do not have to tightly coordinate this change -- all endpoints will handle all domains because CloudFront doesn't use the endpoint to make the routing decision, it uses SNI and the HTTP Host header.
This seems almost like a no-brainer. A default, wildcard CloudFront distribution is pointed to by a default, wildcard DNS record, and uses Lambda#Edge to identify which of your clusters handles a given subdomain using a database lookup, followed by the deployment -- automated, of course -- of a distribution for each of your customers, which already knows how to forward the request to the correct cluster, so no further database queries are needed after the subdomain is fully live. You'll need to ask AWS Support to increase your account's limit for the number of CloudFront distributions from the default of 200, but that should not be a problem.
There are multiple ways to accomplish that database lookup. As mentioned, before, the Lambda#Edge function can invoke a second Lambda function inside VPC to query the database for routing instructions, or you could push the domain location config to a DynamoDB global table, which would replicate your domain routing instructions to multiple DynamoDB regions (currently Virginia, Ohio, Oregon, Ireland, and Frankfurt) and DynamoDB can be queried directly from a Lambda#Edge function.

How To Prevent AWS Lambda Abuse by 3rd-party apps

Very interested in getting hands-on with Serverless in 2018. Already looking to implement usage of AWS Lambda in several decentralized app projects. However, I don't yet understand how you can prevent abuse of your endpoint from a 3rd-party app (perhaps even a competitor), from driving up your usage costs.
I'm not talking about a DDoS, or where all the traffic is coming from a single IP, which can happen on any network, but specifically having a 3rd-party app's customers directly make the REST calls, which cause your usage costs to rise, because their app is piggy-backing on your "open" endpoints.
For example:
I wish to create an endpoint on AWS Lambda to give me the current price of Ethereum ETH/USD. What would prevent another (or every) dapp developer from using MY lambda endpoint and causing excessive billing charges to my account?

When you deploy an endpoint that is open to the world, you're opening it to be used, but also to be abused.
AWS provides services to avoid common abuse methods, such as AWS Shield, which mitigates against DDoS, etc., however, they do not know what is or is not abuse of your Lambda function, as you are asking.
If your Lambda function is private, then you should use one of the API gateway security mechanisms to prevent abuse:
IAM security
API key security
Custom security authorization
With one of these in place, your Lambda function can only by called by authorized users. Without one of these in place, there is no way to prevent the type of abuse you're concerned about.

Unlimited access to your public Lambda functions - either by bad actors, or by bad software developed by legitimate 3rd parties, can result in unwanted usage of billable corporate resources, and can degrade application performance. It is important to you consider ways of limiting and restricting access to your Lambda clients as part of your systems security design, to prevent runaway function invocations and uncontrolled costs.
Consider using the following approach to preventing execution "abuse" of your Lambda endpoint by 3rd party apps:
One factor you want to control is concurrency, or number of concurrent requests that are supported per account and per function. You are billed per request plus total memory allocation per request, so this is the unit you want to control. To prevent run away costs, you prevent run away executions - either by bad actors, or by bad software cause by legitimate 3rd parties.
From Managing Concurrency
The unit of scale for AWS Lambda is a concurrent execution (see
Understanding Scaling Behavior for more details). However, scaling
indefinitely is not desirable in all scenarios. For example, you may
want to control your concurrency for cost reasons, or to regulate how
long it takes you to process a batch of events, or to simply match it
with a downstream resource. To assist with this, Lambda provides a
concurrent execution limit control at both the account level and the
function level.
In addition to per account and per Lambda invocation limits, you can also control Lambda exposure by wrapping Lambda calls in an AWS API Gateway, and Create and Use API Gateway Usage Plans:
After you create, test, and deploy your APIs, you can use API Gateway
usage plans to extend them as product offerings for your customers.
You can provide usage plans to allow specified customers to access
selected APIs at agreed-upon request rates and quotas that can meet
their business requirements and budget constraints.
What Is a Usage Plan? A usage plan prescribes who can access one or
more deployed API stages— and also how much and how fast the caller
can access the APIs. The plan uses an API key to identify an API
client and meters access to an API stage with the configurable
throttling and quota limits that are enforced on individual client API
keys.
The throttling prescribes the request rate limits that are applied to
each API key. The quotas are the maximum number of requests with a
given API key submitted within a specified time interval. You can
configure individual API methods to require API key authorization
based on usage plan configuration. An API stage is identified by an
API identifier and a stage name.
Using API Gateway Limits to create Gateway Usage Plans per customer, you can control API and Lambda access prevent uncontrolled account billing.

#Matt answer is correct, yet incomplete.
Adding a security layer is a necessary step towards security, but doesn't protect you from authenticated callers, as #Rodrigo's answer states.
I actually just encountered - and solved - this issue on one of my lambda, thanks to this article: https://itnext.io/the-everything-guide-to-lambda-throttling-reserved-concurrency-and-execution-limits-d64f144129e5
Basically, I added a single line on my serverless.yml file, in my function that gets called by the said authirized 3rd party:
reservedConcurrency: 1
And here goes the whole function:
refresh-cache:
handler: src/functions/refresh-cache.refreshCache
# XXX Ensures the lambda always has one slot available, and never use more than one lambda instance at once.
# Avoids GraphCMS webhooks to abuse our lambda (GCMS will trigger the webhook once per create/update/delete operation)
# This makes sure only one instance of that lambda can run at once, to avoid refreshing the cache with parallel runs
# Avoid spawning tons of API calls (most of them would timeout anyway, around 80%)
# See https://itnext.io/the-everything-guide-to-lambda-throttling-reserved-concurrency-and-execution-limits-d64f144129e5
reservedConcurrency: 1
events:
- http:
method: POST
path: /refresh-cache
cors: true
The refresh-cache lambda was invoked by a webhook triggered by a third party service when any data change. When importing a dataset, it would for instance trigger as much as 100 calls to refresh-cache. This behaviour was completely spamming my API, which in turn was running requests to other services in order to perform a cache invalidation.
Adding this single line improved the situation a lot, because only one instance of the lambda was running at once (no concurrent run), the number of calls was divided by ~10, instead of 50 calls to refresh-cache, it only triggered 3-4, and all those call worked (200 instead of 500 due to timeout issue).
Overall, pretty good. Not yet perfect for my workflow, but a step forward.
Not related, but I used https://epsagon.com/ which tremendously helped me figuring out what was happening on AWS Lambda. Here is what I got:
Before applying reservedConcurrency limit to the lambda:
You can see that most calls fail with timeout (30000ms), only the few first succeed because the lambda isn't overloaded yet.
After applying reservedConcurrency limit to the lambda:
You can see that all calls succeed, and they are much faster. No timeout.
Saves both money, and time.
Using reservedConcurrency is not the only way to deal with this issue, there are many other, as #Rodrigo stated in his answer. But it's a working one, that may fit in your workflow. It's applied on the Lambda level, not on API Gateway (if I understand the docs correctly).

Is significant latency introduced by API Gateway?

I'm trying to figure out where the latency in my calls is coming from, please let me know if any of this information could be presented in a format that is more clear!
Some background: I have two systems--System A and System B. I manually (through Postman) hit an endpoint on System A that invokes an endpoint on System B.
System A is hosted on an EC2 instance.
When System B is hosted on a Lambda function behind API Gateway, the
latency for the call is 125 ms.
When System B is hosted on an
EC2 instance, the latency for the call is 8 ms.
When System B is
hosted on an EC2 instance behind API Gateway, the latency for the
call is 100 ms.
So, my hypothesis is that API Gateway is the reason for increased latency when it's paired with the Lambda function as well. Can anyone confirm if this is the case, and if so, what is API Gateway doing that increases the latency so much? Is there any way around it? Thank you!

It might not be exactly what the original question asks for, but I'll add a comment about CloudFront.
In my experience, both CloudFront and API Gateway will add at least 100 ms each for every HTTPS request on average - maybe even more.
This is due to the fact that in order to secure your API call, API Gateway enforces SSL in all of its components. This means that if you are using SSL on your backend, that your first API call will have to negotiate 3 SSL handshakes:
Client to CloudFront
CloudFront to API Gateway
API Gateway to your backend
It is not uncommon for these handshakes to take over 100 milliseconds, meaning that a single request to an inactive API could see over 300 milliseconds of additional overhead. Both CloudFront and API Gateway attempt to reuse connections, so over a large number of requests you’d expect to see that the overhead for each call would approach only the cost of the initial SSL handshake. Unfortunately, if you’re testing from a web browser and making a single call against an API not yet in production, you will likely not see this.
In the same discussion, it was eventually clarified what the "large number of requests" should be to actually see that connection reuse:
Additionally, when I meant large, I should have been slightly more precise in scale. 1000 requests from a single source may not see significant reuse, but APIs that are seeing that many per second from multiple sources would definitely expect to see the results I mentioned.
...
Unfortunately, while cannot give you an exact number, you will not see any significant connection reuse until you approach closer to 100 requests per second.
Bear in mind that this is a thread from mid-late 2016, and there should be some improvements already in place. But in my own experience, this overhead is still present and performing a loadtest on a simple API with 2000 rps is still giving me >200 ms extra latency as of 2018.
source: https://forums.aws.amazon.com/thread.jspa?messageID=737224

Heard from Amazon support on this:
With API Gateway it requires going from the client to API Gateway,
which means leaving the VPC and going out to the internet, then back
to your VPC to go to your other EC2 Instance, then back to API
Gateway, which means leaving your VPC again and then back to your
first EC2 instance.
So this additional latency is expected. The only way to lower the
latency is to add in API Caching which is only going to be useful is
if the content you are requesting is going to be static and not
updating constantly. You will still see the longer latency when the
item is removed from cache and needs to be fetched from the System,
but it will lower most calls.
So I guess the latency is normal, which is unfortunate, but hopefully not something we'll have to deal with constantly moving forward.

In the direct case (#2) are you using SSL? 8 ms is very fast for SSL, although if it's within an AZ I suppose it's possible. If you aren't using SSL there, then using APIGW will introduce a secure TLS connection between the client and CloudFront which of course has a latency penalty. But usually that's worth it for a secure connection since the latency is only on the initial establishment.
Once a connection is established all the way through, or when the API has moderate, sustained volume, I'd expect the average latency with APIGW to drop significantly. You'll still see the ~100 ms latency when establishing a new connection though.
Unfortunately the use case you're describing (EC2 -> APIGW -> EC2) isn't great right now. Since APIGW is behind CloudFront, it is optimized for clients all over the world, but you will see additional latency when the client is on EC2.
Edit:
And the reason why you only see a small penalty when adding Lambda is that APIGW already has lots of established connections to Lambda, since it's a single endpoint with a handful of IPs. The actual overhead (not connection related) in APIGW should be similar to Lambda overhead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js