I want to set up an EC2 instance running on a private VPC. It can connect to the Internet from the private VPC but can not access from outside. And there is a lambda function to trigger the EC2 to initiate some interactions with external resources (S3, Dynamo, Internet).
I have set up a VPC as following:
An EC2 instance running docker in a private VPC subnet
An ALB(application load-balancer) configured as internal and in private subnets (same as the EC2 subnet)
A NAT Gateway which is working
A lambda function which will do HTTPs GET and POST to the Internet and ALB
Route53 private Hostzone has a record set that route "abcd.internal/api" to the ALB.
Here is the problem. The lambda function can connect to the Internet with HTTPs, but when it fails to HTTPs GET to the ALB with the private Hostzone record("abcd.internal").
My understanding is my ALB, EC2, lambda, NAT Gateway and Route53 are configured in the same VPC, they should be able to talk to each other with the private DNS name. I don't know why it fails.
Note: Before setting up a internal ALB, I did try setting up a internet-facing ALB in a public subnet, then configure a public Hostzone record set "abcd.public" to this ALB. It can talk to the EC2 instance and the EC2 instance can interact with the Internet through the NAT Gateway. So the "EC2 to Internet" part is working.
Update:
I finally dig some error messages in lambda log as follows:
Error: Hostname/IP doesn't match certificate's altnames: "Host: abcd.internal. is not in the cert's altnames: DNS:.public"]
reason: 'Host: abcd.internal. is not in the cert\'s altnames: DNS:.public',
host: 'abcd.internal.',
That is interesting. I do have a public hostzone co-exist with the private hostzone, but the public hostzone is for other purpose. I dont know why the lambda function use the public DNS rather than the private DNS since it was configured inside a private subnet.
Thanks for everyone who post comments and gave suggestions.
To solve this problem, I have almost found every possible solutions online. I put everything at the right position. Lambda function, ELB and EC2 are in the same VPC private subnet. Route53, NAT and IGW are properly set up. I did try playing with the DHCP options set, didn't work. Maybe I don't fully understand this DHCP and I can't find an example.
It turns out the HTTPS protocol is not working. Before I move to private VPC, I have the same thing setup in a public VPC and resources are using HTTPS to communicate. For example, the lambda function will GET/POST to the EC2 instance or ELB. After I move stuffs into a private VPC, HTTPS commands can not use the internal DNS names.
However, if I use HTTP protocol, resources finally can find each other by internal DNS names.
I still dont know why HTTPS can't be used in the private VPC, but I can live with this solution.
I had the same problem.
The ALB was not added as a trigger for the Lambda which was causing a similar certificate issue for me.
The security group was configured wrongly in my case.
I noticed that the role that I assigned to Lambda should include a policy with create/delete ENI permissions
Sometimes the ALB updates were not quick. so I recreated with the same settings, it started to work.
Did you make sure to check if the IAM role attached to your Lambda has access to ec2 Network related actions? Here's an example IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:CreateNetworkInterface",
"ec2:DescribeNetworkInterfaces",
"ec2:DeleteNetworkInterface",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets"
],
"Resource": [
"*"
],
"Effect": "Allow"
}
]
}
Related
I use aws ecr to get login passwaord then pull docker image from private ECR at the public subnet EC2. This public subnet has already attached a internet gateway.
I already have an endpoint gateway for S3 before, so I created an interface endpoint for ECR (com.amazonaws.ap-southeast-1.ecr.dkr) follow the officail document, its subnet setting is the private subnet, also enable the private DNS.
https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#test-interface-endpoint-aws
After that public EC2 can get password by aws ecr, but docker login fail, private EC2 cannot get password by aws ecr.
EC2s allow all outbound rules and no NACL setting, they IAM role combines the AmazonEC2ContainerRegistryReadOnly and S3 access permission that shown as below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::prod-ap-southeast-1-starport-layer-bucket/*"
}
]
}
Private EC2 aws ecr get-login-password --region ap-southeast-1 error messags is
Connect timeout on endpoint URL: "https://api.ecr.ap-southeast-1.amazonaws.com/"
Use dig showed the ip of api.ecr.ap-southeast-1.amazonaws.com is successful. I did not change any setting after created an interface endpoint. I don't know which step is wrong, please give me some suggestion. Thank you very much.
Update
I have a private VPC with 1 public subnet and 1 private subnet, each has it own route table, public subnet route table add internet gateway, private subnet route table add S3 endpoint.
Secority group
private subnet EC2
Inbound rules: source: sg-ALB, HTTP 80, source: sg-public-EC2, SSH 22
Outbound rule: All traffic
public subnet EC2
Inbound rules: source: All Ipv4, SSH 22
Outbound rule: All traffic
Public EC2 error meesage
Error response from daemon: Get "https://xxx.dkr.ecr.ap-southeast-1.amazonaws.com/v2/":
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Update 2
I have hosted a private zoned by Route 53 in this VPC, no sure that could be a problem or not.
Your endpoint is using https, which means you have to allow port 443 in your security groups.
I've been trying to connect to S3 bucket from a lambda residing in a private subnet. I did the exact same thing for Ec2 instance and it worked like a charm, I'm not sure why with lambda it's such an issue. My lambda times out after a certain defined interval.
Here's my lambda's VPC configuration
Here's the security group output configuration:
Below are the outbound rules of the subnet associated with lambda
As you can see, I created a VPC endpoint to route my traffic through the VPC but it doesn't work. I'm not sure what am I missing here. Below is the VPC Endpoint configuration.
I've given full access to S3 in policy like this:
{
"Statement": [
{
"Action": "*",
"Effect": "Allow",
"Resource": "*",
"Principal": "*"
}
]
}
When I run my lambda code, I get timeout error as below:
You can access Amazon S3 objects using VPC endpoint only when the S3 objects are in the same Region as the Amazon S3 gateway VPC endpoint. Confirm that your objects and endpoint are in the same Region.
To reproduce your situation, I performed the following steps:
Created an AWS Lambda function that calls ListBuckets(). Tested it without attaching to a VPC. It worked fine.
Created a VPC with just a private subnet
Added an Amazon S3 Endpoint Gateway to the VPC and subnet
Reconfigured the Lambda function to use the VPC and subnet
Tested the Lambda function -- it worked fine
I suspect your problem might lie with the Security Group attached to the Lambda function. I left my Outbound rules as "All Traffic 0.0.0.0/0" rather than restricting it. Give that a try and see if it makes things better.
Slightly tearing my hair out with this one... I am trying to run a Docker image on Fargate in a VPC in a Public subnet. When I run this as a Task I get:
ResourceInitializationError: unable to pull secrets or registry auth: pull
command failed: : signal: killed
If I run the Task in a Private subnet, through a NAT, it works. It also works if I run it in a Public subnet of the default VPC.
I have checked through the advice here:
Aws ecs fargate ResourceInitializationError: unable to pull secrets or registry auth
In particular, I have security groups set up to allow all traffic. Also Network ACL set up to allow all traffic. I have even been quite liberal with the IAM permissions, in order to try and eliminate that as a possibility:
The task execution role has:
{
"Action": [
"kms:*",
"secretsmanager:*",
"ssm:*",
"s3:*",
"ecr:*",
"ecs:*",
"ec2:*"
],
"Resource": "*",
"Effect": "Allow"
}
With trust relationship to allow ecs-tasks to assume this role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
The security group is:
sg-093e79ca793d923ab All traffic All traffic All 0.0.0.0/0
And the Network ACL is:
Inbound
Rule number Type Protocol Port range Source Allow/Deny
100 All traffic All All 0.0.0.0/0 Allow
* All traffic All All 0.0.0.0/0 Deny
Outbound
Rule number Type Protocol Port range Destination Allow/Deny
100 All traffic All All 0.0.0.0/0 Allow
* All traffic All All 0.0.0.0/0 Deny
I set up flow logs on the subnet, and I can see that traffic is Accept Ok in both directions.
I do not have any Interface Endpoints set up to reach AWS services without going through the Internet Gateway.
I also have Public IP address assigned to the Fargate instance upon creation.
This should work, since the Public subnet should have access to all needed services through the Internet Gateway. It also works in the default VPC or a Private subnet.
Can anyone suggest what else I should check to debug this?
One of the potential problems for ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed is disabled Auto-assign public IP. After I enabled it (recreating service from the scrath), task run properly without issues.
For those unlucky souls, there is one more thing to check.
I already had an internet gateway in my VPC, DNS was enabled for that VPC, all containers were getting public IPs and the execution role already had access to ECR. But even so, I was still getting the same error.
Turns out the problem was about Routing Table. The routing table of my VPC didn't include a route for directing outbound traffic to internet gateway so my subnet had no internet access.
Adding the second line to the table that routes 0.0.0.0/0 traffic to internet gateway solved the issue.
Edited answer based of feedback from #nathan and #howard-swope
checklist:
The VPC has "DNS hostnames" and "DNS resolution" enabled
"Task execution role" has access to ECR. e.g. has role AmazonECSTaskExecutionRolePolicy
if task is running on a PUBLIC subnet:
The subnets have access to internet. i.e. assigning internet gateway to the subnets.
Enable "assign public IP" when creating the task.
if task is running on a PRIVATE subnet:
The subnets have access to internet. i.e. assigning NAT gateway to the subnets.
... NAT gateway resides on a public subnet
I was facing the same issue. But in my case, I was triggering the Fargate Container from the Lambda function using the RunTask operation. So In the RunTask operation, I was not passing the below parameter:
assignPublicIp: ENABLED
After adding this, Container was triggering without any issues.
It turns out that I did not have DNS support enabled for the VPC. Once this is enabled, it works.
I did not see DNS support explicitly mentioned in any docs for Fargate - I guess its pretty obvious or how else will it look up the various AWS services it needs. But thought it worth noting in an answer against this error message.
For AWS Batch using Fargate, this error was triggered by the 'Assign public IP' setting being disabled.
This setting is configurable during Job Definition step. However, it is not configurable in the UI after the Job Definition had already been created.
AWS container runner needs to access to the container repositories, and AWS service.
If you're on a public subnet, the easiest is to "Auto-assign public IP" to have your containers access to internet, even if your app do not need egress access to internet.
Otherwise, if you're using only AWS services (ECR, and no images pulled from docker.io), then you could use VPC endpoints to access ECR/S3/Cloudwatch, and enabling DNS options on your VPC.
For private subnet, it's the same.
If you're using docker.io images, then you need egress access to internet in your subnet anyway.
In my case of dealing with the above error, while running the run-task command(yes, not via Service route), I was not specifying the security group in the aws ecs run-task --network-configuration. This was resulting in the default SG being picked up from the task VPC. My default SG in that VPC had no inbound/outbound rules defined. I added ONLY the outbound rule to allow all traffic to everywhere and the error went away.
My setup is that the ECS/Fargate task will run in a private subnet with ECR connectivity via VPC Interface endpoints. I had the checklist, mentioned above, checked and in addition added the SG rule.
I recently turned my two AWS public subnets into private subnets and added a public subnet that's got a NAT gateway. The private subnet routing table routs traffic to the NAT gateway and the public one routs it to the Internet Gateway. However, it isn't working and I don't get response to my API calls.
I think this is due to the fact that my VPC endpoint has the two private subnets associated with it instead of the public subnet. I tried to change the associated private subnets to the public one but got the AWS error:
Error modifying subnets
Can't change subnets of a requester-managed endpoint for the service ...
What would be the way to get around this error and add my public subnet to the VPC endpoint?
Additional info: Each private subnet has an EC2 auto-scaling group instance and a serverless aurora DB instance in it.
Cheers, Kris
I also had this annoying problem. The error messages are not really helpful here. They do not reveal which service exactly created those interfaces. So I went to Cloudtrail, listed all events, and searched for the VPC Endpoint name (vpce-1234567890xxx) that refused to be deleted to find out who created it. In my case, it turned out to be the RDS proxy service. So I went to RDS and deleted the proxy.
Since it is requester-managed VPC endpoint:
You cannot modify or detach a requester-managed network interface.
This means that you have to delete the resource that created the endpoint in the first place:
If you delete the resource that the network interface represents, the AWS service detaches and deletes the network interface for you.
AWS Lambda configured to go through NAT gateway with an EIP is not receiving the IP and keeps getting a random one.
I've created a Lambda sitting on a Subnet (private) which redirects all traffic into another Subnet (public) which has a NAT Gateway on it with an attached EIP.
The public Subnet redirects all its traffic to an IGW attached to this same VPC.
All the Lambda does is shoot off an http request and receive a response.
I've checked whether traffic is going through the IGW and NAT, and it looks like it does, because when I remove either one from the respective Route Tables, the Lambda times out on the execution.
I can also see in the NAT's monitoring tab, that when I run the Lambda, there's activity..
Sending this as the options of a nodejs http request Lambda:
{
"options": {
"hostname": "www.whatsmyip.org",
"method": "GET",
"port": 443
}
}
When I'm trying to 'prod' www.whatsmyip.com to see the IP address of the Lambda, I keep getting different random results, instead of receiving back the EIP attached to the NAT Gateway.
Here's a layout of the setup:
Why do you need an IGW when you don't accept incoming connections from the Internet? Your setting seems okay. Remove the NAT GW routing to IGW, remove IGW (not needed). Should work.