AWS ECS: VPC Endpoints and NAT Gateways - amazon-web-services

According to the AWS documentation on NAT Gateways, they cannot send traffic over VPC endpoints, unless it is setup in the following manner:
A NAT gateway cannot send traffic over VPC endpoints [...]. If your instances in the private subnet must access resources over a VPC endpoint [...], use the private subnet’s route table to route the traffic directly to these devices.
Following this example in the docs, I created the following configuration for my ECS app:
VPC (vpc-app) with CIDR 172.31.0.0/16.
App subnet (subnet-app) with the following route table:
Destination | Target
----------------|-----------
172.31.0.0/16 | local
0.0.0.0/0 | nat-main
NAT Gateway (nat-main) in vpc-app in subnet default-1 with the following Route Table:
Destination | Target
----------------|--------------
172.31.0.0/16 | local
0.0.0.0/0 | igw-xxxxxxxx
Security Group (sg-app) with port 443 open for subnet-app.
VPC Endpoints (Interface type) with vpc-app, subnet-app and sg-app for the following services:
com.amazonaws.eu-west-1.ecr.api
com.amazonaws.eu-west-1.ecr.dkr
com.amazonaws.eu-west-1.ecs
com.amazonaws.eu-west-1.ecs-agent
com.amazonaws.eu-west-1.ecs-telemetry
com.amazonaws.eu-west-1.s3 (Gateway)
It's also important to mention that I've enabled DNS Resolution and DNS Hostnames for vpc-app, as well as the Enable Private DNS Name option for the ecr-dkr and ecr-api VPC endpoints.
I've also tried working only with Fargate containers since they don't have the added complication of the ECS Agent, and because according to the docs:
Tasks using the Fargate launch type only require the com.amazonaws.region.ecr.dkr Amazon ECR VPC endpoint and the Amazon S3 gateway endpoint to take advantage of this feature.
This also doesn't work and every time my Fargate tasks run I see a spike in Bytes out to source under nat-main's Monitoring.
No matter what I try, the EC2 instances (and Fargate tasks) in the subnet-app are still pulling images using nat-main and not going to the local address of the ECR service.
I've restarted the ECS Agent and made sure to check all the boxes in the ECS Interface VPC Endpoints guide AND the ECR Interface Endpoints guide.
What am I missing here?
Any help would be appreciated.

After many hours of trial and error, and with lots of help from #jogold, the missing piece was found in this blog post:
The next step is to create a gateway VPC endpoint for S3. This is necessary because ECR uses S3 to store Docker image layers. When your instances download Docker images from ECR, they must access ECR to get the image manifest and S3 to download the actual image layers.
After I created the S3 Gateway VPCE, I forgot to add its address to subnet-app's routing table, so although the initial request to my ECR URI was made using the internal address, the downloading of the image from S3 still used the NAT Gateway.
After adding the entry, the network usage of the NAT Gateway dropped dramatically.
More information on how to setup Gateway VPCE can be found here.

Interface VPC endpoints work with DNS resolution, not routing.
In order for you configuration to work, you need to ensure that you checked Enable Private DNS Name when you created the endpoint. This enables you to make requests to the service using its default DNS hostname instead of the endpoint-specific DNS hostnames.
From the documentation:
When you create an interface endpoint, we generate endpoint-specific DNS hostnames that you can use to communicate with the service. For AWS services and AWS Marketplace partner services, you can optionally enable private DNS for the endpoint. This option associates a private hosted zone with your VPC. The hosted zone contains a record set for the default DNS name for the service (for example, ec2.us-east-1.amazonaws.com) that resolves to the private IP addresses of the endpoint network interfaces in your VPC. This enables you to make requests to the service using its default DNS hostname instead of the endpoint-specific DNS hostnames. For example, if your existing applications make requests to an AWS service, they can continue to make requests through the interface endpoint without requiring any configuration changes.
The alternative is to update your application to use your endpoint-specific DNS hostnames.
Note that to use private DNS names, DNS resolution and DNS hostnames must be enabled for your VPC:
Also note that in order to use ECR/ECS without a NAT gateway, you need to configure a S3 endpoint (gateway, requires route table update) to allow instances to download the image layers from the underlying private Amazon S3 buckets that host them. More information in Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR

Related

Django App in ECS Container Cannot Connect to S3 in Gov Cloud

I have a container running in an EC2 instance on ECS. The container is hosting a django based application that utilizes S3 and RDS for its file storage and db needs respectively. I have appropriately configured my VPC, Subnets, VPC endpoints, Internet Gateway, roles, security groups, and other parameters such that I am able to host the site, connect to the RDS instance, and I can even access the site.
The issue is with the connection to S3. When I try to run the command python manage.py collectstatic --no-input which should upload/update any new/modified files to S3 as part of the application set up the program hangs and will not continue. No files are transferred to the already set up S3 bucket.
Details of the set up:
All of the below is hosted on AWS Gov Cloud
VPC and Subnets
1 VPC located in Gov Cloud East with 2 availability zones (AZ) and one private and public subnet in each AZ (4 total subnets)
The 3 default routing tables (1 for each private subnet, and 1 for the two public subnets together)
DNS hostnames and DNS resolution are both enabled
VPC Endpoints
All endpoints have the "vpce-sg" security group attached and are associated to the above vpc
s3 gateway endpoint (set up to use the two private subnet routing tables)
ecr-api interface endpoint
ecr-dkr interface endpoint
ecs-agetn interface endpoint
ecs interface endpoint
ecs-telemetry interface endpoint
logs interface endpoint
rds interface endpoint
Security Groups
Elastic Load Balancer Security Group (elb-sg)
Used for the elastic load balancer
Only allows inbound traffic from my local IP
No outbound restrictions
ECS Security Group (ecs-sg)
Used for the EC2 instance in ECS
Allows all traffic from the elb-sg
Allows http:80, https:443 from vpce-sg for s3
Allows postgresql:5432 from vpce-sg for rds
No outbound restrictions
VPC Endpoints Security Group (vpce-sg)
Used for all vpc endpoints
Allows http:80, https:443 from ecs-sg for s3
Allows postgresql:5432 from ecs-sg for rds
No outbound restrictions
Elastic Load Balancer
Set up to use an Amazon Certificate https connection with a domain managed by GoDaddy since Gov Cloud route53 does not allow public hosted zones
Listener on http permanently redirects to https
Roles
ecsInstanceRole (Used for the EC2 instance on ECS)
Attached policies: AmazonS3FullAccess, AmazonEC2ContainerServiceforEC2Role, AmazonRDSFullAccess
Trust relationships: ec2.amazonaws.com
ecsTaskExecutionRole (Used for executionRole in task definition)
Attached policies: AmazonECSTaskExecutionRolePolicy
Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com
ecsRunTaskRole (Used for taskRole in task definition)
Attached policies: AmazonS3FullAccess, CloudWatchLogsFullAccess, AmazonRDSFullAccess
Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com
S3 Bucket
Standard bucket set up in the same Gov Cloud region as everything else
Trouble Shooting
If I bypass the connection to s3 the application successfully launches and I can connect to the website, but since static files are supposed to be hosted on s3 there is less formatting and images are missing.
Using a bastion instance I was able to ssh into the EC2 instance running the container and successfully test my connection to s3 from there using aws s3 ls s3://BUCKET_NAME
If I connect to a shell within the application container itself and I try to connect to the bucket using...
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
s3.meta.client.head_bucket(Bucket=bucket.name)
I receive a timeout error...
File "/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 179, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f3da4467190>, 'Connection to BUCKET_NAME.s3.amazonaws.com timed out. (connect timeout=60)')
...
File "/.venv/lib/python3.9/site-packages/botocore/httpsession.py", line 418, in send
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://BUCKET_NAME.s3.amazonaws.com/"
Based on this article I think this may have something to do with the fact that I am using the GoDaddy DNS servers which may be preventing proper URL resolution for S3.
If you're using the Amazon DNS servers, you must enable both DNS
hostnames and DNS resolution for your VPC. If you're using your own
DNS server, ensure that requests to Amazon S3 resolve correctly to the
IP addresses maintained by AWS.
I am unsure of how to ensure that requests to Amazon S3 resolve correctly to the IP address maintained by AWS. Perhaps I need to set up another private DNS on route53?
I have tried a very similar set up for this application in AWS non-Gov Cloud using route53 public DNS instead of GoDaddy and there is no issue connecting to S3.
Please let me know if there is any other information I can provide to help.
AWS Region
The issue lies within how boto3 handles different aws regions. This may be unique to usage on AWS GovCloud. Originally I did not have a region configured for S3, but according to the docs an optional environment variable named AWS_S3_REGION_NAME can be set.
AWS_S3_REGION_NAME (optional: default is None)
Name of the AWS S3 region to use (eg. eu-west-1)
I reached this conclusion thanks to a stackoverflow answer I was using to try to manually connect to s3 via boto3. I noticed that they included an argument for region_name when creating the session, which alerted me to make sure I had appropriately set the region in my app.settings and environment variables.
If anyone has some background on why this needs to be set for GovCloud functionality but apparently not for commercial, I would be interested to know.
Signature Version
I also had to specify the AWS_S3_SIGNATURE_VERSION in app.settings so boto3 knew to use version 4 of the signature. According to the docs
As of boto3 version 1.13.21 the default signature version used for generating presigned urls is still v2. To be able to access your s3 objects in all regions through presigned urls, explicitly set this to s3v4. Set this to use an alternate version such as s3. Note that only certain regions support the legacy s3 (also known as v2) version.
Some additional information in this stackoverflow response details that new S3 regions deployed after January 2014 will only support signature version 4. AWS docs notice
Apparently GovCloud is in this group of newly deployed regions.
If you do not specify this calls to the s3 bucket for static files, such as js scripts, during operation of the web application will receiving a 400 response. S3 responds with the error message
<Code>InvalidRequest</Code>
<Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message>
<RequestId>#########</RequestId>
<HostId>##########</HostId>
</Error>```

aws vpc endpoints - how it works?

I am trying to understand the concept of how VPC endpoints work and I am not sure that I understand the AWS documentation. For example, I have a private S3 bucket and I have an EKS cluster. So if my bucket is private I believe that traffic from the EKS cluster to S3 does not go through the internet, but only through the AWS network. But in a case my s3 bucket was public, then probably I will need to set up the VPC endpoint, so traffic will not leave the AWS. The same logic I would expect with ECR, if it is private you load images to your EKS through AWS network.
So what is the exact case when you need to use VPC endpoint within your AWS account (not from on-prem or another VPC)?
VPC endpoints are typically used with public AWS services (such as S3, DynamoDB, ECR, etc.) when the client applications are hosted inside your VPC and you do not want to route traffic via public Internet, which would otherwise result in a number of hops to reach the AWS service.
Imagine a situation when you have an app running on an EC2 instance, which is deployed to a private subnet of your VPC (i.e. a Pod in your EKS cluster). This app reads/writes data from/to AWS S3. If you do not use a VPC endpoint, your traffic will first reach your NAT gateway, then your VPC's Internet gateway out to the public Internet. Eventually, it will hit AWS S3. The response will travel back via the same route.
Same thing with ECR (i.e. a new instance of your Kubernetes Pod started by the kubelet). It's better (i.e. quicker) to pick the shortest route to download a Docker image from ECR rather than traverse a number of switches/routers. With a VPC endpoint your traffic will first hit the VPC endpoint (without leaving your private subnet) and then reach e.g. ECR directly (traffic does not leave the Amazon network).
As correctly mentioned by #jarmod, one should differentiate between routing (Layer 3 in the OSI model) and authentication/authorization (Layer 7). For example, you can use a VPC endpoint to reach AWS S3, but not be authorized (or even unauthenticated) to e.g. read a file from an S3 bucket.
Hope this clarifies the idea behind using VPC endpoints.

How does the AWS Inteface VPC endpoint actually route traffic to regional service?

When I configure an AWS Gateway VPC endpoint, a route table entry is created that points to the Gateway. Here, Gateway can be thought of performing the routing to AWS service (over private network).
However, for an AWS Inteface VPC endpoint, all that is visible is a Network interface that has a private IP address of the subnet. By default, a private IP can send traffic within the subnet or entire VPC provided Security Group and NACL allows the traffic. & it appears in this case there is no Route table entry to a Gateway or a Router for allowing traffic outside VPC.
How / Where is the interface routing the traffic to i.e. How does traffic leave the customer VPC?
Of course I understand that the traffic finally reaches the intended AWS service over private network but here I am trying to find out where is the Gateway or Router? Does AWS hide this implementation?
I cannot get my head around the fact that a simple Network Interface can accept traffic and route it to a service all by itself i.e. performing routing by itself? Clearly, in this case the traffic appears not flowing through the VPC router or another Gateway device.
I am aware this might be an AWS confidential implementation but any thoughts / idea on how they might have designed this feature?
It doesn't provide routing at all, by default a VPC interface endpoint when created will create an ENI per subnet in the VPC for you. It will also provide you a DNS name per each AZ and a global name that you can use within your applications.
In addition it supports the ability to have the AWS service domain name for the VPC interface endpoint be resolvable to the private IPs of the endpoint. As long as your VPC has DNS enabled it will first check the VPC private DNS resolver and then resolve it to the private IP rather than the public one.
This is done by adding an additional private hosted zone to your VPC which resolves service domains in your region such as ec2.us-east-1.amazonaws.com.
From the AWS side this is just an ENI created in your AWS VPC that is connected to one of AWS internal VPCs. It's actually possible to implement this for your own services too to share with another organisations VPCs, this is implemented using AWS PrivateLink.
For more information take a look at the Private DNS for interface endpoints page.

Route table for docker hub and vpc endpoints for private hosted instances: AWS

I have a docker image which is just an Java application. The java application reads data from DynamoDB and S3 buckets and outputs something (its a test app). I have hosted the docker images onto public docker-hub repo.
In AWS, i have created private subnet which is hosting an EC2 via AWS ECS. Now to have security high; i am using VPC Endpoints for DynamoDB and S3 bucket operations for the containers.
And i have used NAT Gateway to allow EC2 to pull docker images from docker-hub.
Problem:
When i remove VPC Endpoint, the application is able to read DynamoDB and S3 via NAT. Which means the traffic is going through public network.
Thoughts:
Can not whitelist the Ip addresses of Dockerhub as it can change.
Since AWS ECS handles all the docker pull etc tasks, i do not have control to customize.
I do not want to use AWS container registry. I prefer dockerhub.
DynamoDB/S3 private addresses are not known
Question:
How to make sure that traffic for docker hub should only be allowed via NAT?
How to make sure that the DynamoDB and S3 access should be via Endpoints only?
Thanks for your help
IF you want to restrict outbound traffic over your NAT (by DNS hostname) to DockerHub only you will need a third party solution that can allow or deny outbound traffic before it traverses the internet.
You would install this appliance in a separate subnet which has NAT Gateway access. Then in your existing subnet(s) for ECS you would update the route table to have the 0.0.0.0/0 route speak to this appliance (by specifying its ENI). If you check the AWS marketplace there may be a solution already in place to fulfil the domain filter.
Alternatively you could automate a tool that is able scrape the whitelisted IP addresses for DockerHub, and then have it add these as allow all traffic rules with a NACL. This NACL would only be applied to the subnets that the NAT Gateway resides in.
Regarding your second question, from the VPC point of view by adding the prefix list of the S3 and DynamoDB endpoints to the route table it will forward any requests that hit these API endpoints through the private route.
At this time DynamoDB does not have the ability to prevent public routed interaction, however S3 does. By adding a condition of the VPCE to its bucket policy you can deny any access that tries to interact outside of the listed VPC Endpoint. Be careful not to block yourself access from the console however, by blocking only the specific verbs that you don't want allowed.

How to use AWS VPC endpoints in NAT enabled subnets?

I am using few AWS Lambda functions, which are sitting inside private subnets,
These private subnets have VPC endpoints configured for the services for which the functions need access to,
The current setup does not use a NAT gateway, therefore all the traffic from the functions is going through the VPC endpoints.
I now have a use-case where we need to use a NAT gateway,
But would enabling NAT mean that the Functions would no longer use the VPC endpoints for external service access, and instead use the NAT?
I think this works as follows. For:
Gateway endpoints (S3, DynamoDB)
Routes to them are added automatically to our route tables when you create them. Docs says:
If you have an existing route in your route table for all internet
traffic (0.0.0.0/0) that points to an internet gateway, the endpoint
route takes precedence for all traffic destined for the service,
because the IP address range for the service is more specific than
0.0.0.0/0. All other internet traffic goes to your internet gateway, including traffic that's destined for the service in other Regions.
Interface VPC Endpoints
They work by modifying IP addresses in a DNS of a service. The IP address will be private addresses of the endpoint interfaces. Docs says:
The hosted zone contains a record set for the default DNS name for the
service (for example, ec2.us-east-1.amazonaws.com) that resolves to
the private IP addresses of the endpoint network interfaces in your
VPC. This enables you to make requests to the service using its
default DNS hostname instead of the endpoint-specific DNS hostnames.
To use private DNS, you must set the following VPC attributes to true:
enableDnsHostnames and enableDnsSupport.
Conclusion
So in both cases, priority is given to the interfaces, not the internet. I recommend checking the links provided. They have more info with examples to double check my conclusions.
VPC Endpoints or NAT Gateway?
AWS services like EC2, RDS, Lambda, and ElastiCache come with an Elastic Network Interface (ENI), which enables communication from within your VPCs via Private Endpoints. However, many AWS services provide a REST API, available via the Internet only. A few examples: S3, DynamoDB, CloudWatch, SQS, and Kinesis.
There are three options to make these services accessible from private subnets:
A VPC Endpoint type: Gateway Endpoints is free of charge, but are only available for S3 and DynamoDB.
A VPC Endpoint type: Interface Endpoint costs $7.20 per month and AZ plus $0.01 per GB and is available for most AWS services.
A NAT Gateway can be used to access AWS services or any other services with a public API. Costs are $32.40 per month and AZ plus $0.045 per GB.
Keep the following rules of thumb in mind when designing your network architecture.
Adding Gateway Endpoints for S3 and DynamoDB should be your default option.
Do you need to access non-AWS resources via the Internet, add a NAT Gateway. Do the math if traffic to AWS services justifies additional Interface Endpoints.
Are you only accessing AWS services from the private subnets? No more than four different services? Use Interface Endpoints. Otherwise, do the math to calculate costs for Interface Endpoints and NAT Gateway.
Ref Link: https://cloudonaut.io/advanved-aws-networking-pitfalls-that-you-should-avoid/