ECS Fargate Task in EventBridge fails with ResourceInitializationError - amazon-web-services

I have created an ECS Fargate Task, which I can manually run. It updates a Dynomodb and I get logs.
Now I want this to run on a schedule. I have setup a scheduled ECS task through EventBridge. However, this does not run.
My looking at the EventBridge logs I can see that the container has been stopped for the following stopped reason:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource
retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3
time(s): RequestError: send request failed caused by: Post https://api.ecr....
I thought this might be a problem with permissions. However, I tested giving the Task Execution Role full power user permissions and I still get the same error. Could the problem be something else?

This is due to a connectivity issue.
The docs say the following:
For tasks on Fargate, in order for the task to pull the container image it must either use a public subnet and be assigned a public IP address or a private subnet that has a route to the internet or a NAT gateway that can route requests to the internet.
So you need to make sure your task has a route to an internet gateway (i.e. it's in a Public subnet) or a NAT gateway.
Alternatively, if your service is in an isolated subnet, you need to create VPC endpoints for ECR and other services you need to call, as described in the docs:
To allow your tasks to pull private images from Amazon ECR, you must create the interface VPC endpoints for Amazon ECR.
When you create a scheduled task, you also specify the networking options. The docs mention this step:
(Optional) Expand Configure network configuration to specify a network configuration. This is required for tasks hosted on Fargate and for tasks using the awsvpc network mode.
For Subnets, specify one or more subnet IDs.
For Security groups, specify one or more security group IDs.
For Auto-assign public IP, specify whether to assign a public IP address from your subnet to the task.
So the networking configuration changed between the manually run task and the scheduled task. Refer to the above to figure out the needed settings for your case.

I fixed this by enabling auto-assign public IP.
However, to do this, I had to first change from "Capacity provider strategy" -
"Use cluster default", to "Launch type" - "FARGATE". Then the option to enable auto-assign public IP became available in the dropdown in the EventBridge UI.
This seems odd to me, because my default capacity provider strategy for my cluster is Fargate. But it is working now.

Need to use a gateway to follow the traffic from ECS to ECR. It can either Internet Gateway or NAT Gateway eventually which would be effecting cost factor.
But where we can resolve this scenario, by creating VPC Endpoints. Which maintains the traffic within the AWS Resources.
Endpoints Required for this would be :
S3 Gateway
ECR
ECS

Related

Calling AWS Services from a container on ECS + EC2. Connection Timeout

I want to run an ECS Task on EC2 instance, and I want that task/container to be able to call other AWS services via Boto3.
When I run the same task on Fargate, it works as expected and I am able to call other AWS services from the task/container. When I run the ECS Task on EC2, it given me connection timeout errors when attempting to call other AWS services. (The specific errors depend on the service.)
In an attempt to rule out any permission issues, I am running in a public subnet and using a single IAM role (with the AdministratorAccess policy) for the EC2 instance, ECS task role, and ECS task execution role.
The ECS Task on EC2 IS able to access the internet (which I confirmed by having it ping google.com).
What are any other conditions that need to be satisfied in order to call other AWS services from a container on ECS + EC2?
The cause of my issue was using a public subnet and the awsvpc network mode.
Using Amazon EC2 — You can launch EC2 instances on a public subnet.
Amazon ECS uses these EC2 instances as cluster capacity, and any
containers that are running on the instances can use the underlying
public IP address of the host for outbound networking. This applies to
both the host and bridge network modes. However, the awsvpc network
mode doesn't provide task ENIs with public IP addresses. Therefore,
they can’t make direct use of an internet gateway.
-- Amazon Elastic Container Service Best Practices Guide

How can I access aws resources in VPC from AWS glue?

I have a glue job which is hitting an API hosted over an EC2 instance.
The problem is EC2 instance resides within a VPC blocking all public access.
I tried creating an endpoint interface in my VPC but still can't access the REST API.
The host is always unreachable but when I try to access the API from VPC it is working fine.
The security group associated with the EC2 instance is used while creating the VPC Endpoint.
Any help is appreciated
If you go to AWS Glue console, under connections, create a connection. What is meant by a dummy connection, is just be a non-existent database or resource for example: jdbc:mysql://some-fake-endpoint-here:3306/mydb. After this you choose the correct VPC, subnet and security group. Which means a test connection will not work in this context but what it brings is a way to introduce your VPC, Subnet and Security group information to the job. Testing such a connection can be done using a python-shell job or launch an ec2 instance in the same vpc or same subnet and run something like nc -vz endport port.
This connection metadata information will facilitate the launching of elastic network interfaces in your account that allow glue DPUs to communicate with your resource at runtime. More on how connections in glue is discussed here.

AWS ECS Fargate Platform 1.4 error ResourceInitializationError: unable to pull secrets or registry auth: execution resource

I am using docker containers with secrets on ECS, without problems. After moving to fargate and platform 1.4 for efs support i start getting the following error.
Any help please?
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 1 time(s): secret arn:aws:secretsmanager:eu-central-1:.....
Here's a checklist:
If your ECS tasks are in a public subnet (0.0.0.0/0 routes to Internet Gateway) make sure your tasks can call the "public" endpoint for Secrets Manager. Basically, outbound TCP/443.
If your ECS tasks are in a private subnet, make sure that one of the following is true: (a) your instances need to connect to the Internet through a NAT gateway (0.0.0.0/0 routes to NAT gateway) or (b) you have an AWS PrivateLink endpoint to secrets manager connected to your VPC (and to your subnets)
If you have an AWS PrivateLink connection, make sure the associated Security Group has inbound access from the security groups linked to your ECS tasks.
Make sure you have set GetSecretValue IAM permission to the ARN(s) of the secrets manager entry(or entries) set in the ECS "tasks role".
Edit: Here's another excellent answer - https://stackoverflow.com/a/66802973
I had the same error message, but the checklist above misses the cause of my problem. If you are using VPC endpoints to access AWS services (ie, secretsmanager, ecr, SQS, etc) then those endpoints MUST permit access to the security group that is associated with the VPC subnet that your ECS instance is running in.
Another watchit is, if you are using EFS to host volumes, ensure that your volumes can be mounted by the same security group identified above. Go to EFS, select the appropriate file system, Network tab, then Manage.

How do I set the AWS peering connection DNS resolution options through CloudFormation?

I have two VPCs:
VPC1 which holds our RDS instance.
VPC2 which holds our cluster of EC2 instances.
We have successfully setup a VPC peering connection, routes and security groups to allow appropriate communication.
In order to resolve the RDS instance AZ-appropriate local IP address from it's hostname, we need to follow these instructions and set --requester-peering-connection-options AllowDnsResolutionFromRemoteVpc=true.
If I do this manually through the AWS Console or the AWS CLI it all works fine, however I'm creating the cluster of EC2 instances through CloudFormation and the option is missing from the CloudFormation documentation.
The effect of this is that my stack starts up and fails because the services themselves cannot connect to the database.
Am I doing something obvious wrong, or is this just Amazon being incomplete?
Thanks!
Due to the frequency of updates, there are many times where an AWS feature isn't available in CloudFormation (ALB targeting Lambda used to be) - you end up having to create a custom resource to manage it. It's not too bad, just make sure that your lambda responds with success or failure in all scenarios, including exceptions, otherwise your stack will be 'in progress' for hours.

Why can't my ECS service register available EC2 instances with my ELB?

I've got an EC2 launch configuration that builds the ECS optimized AMI. I've got an auto scaling group that ensures that I've got at least two available instances at all times. Finally, I've got a load balancer.
I'm trying to create an ECS service that distributes my tasks across the instances in the load balancer.
After reading the documentation for ECS load balancing, it's my understanding that my ASG should not automatically register my EC2 instances with the ELB, because ECS takes care of that. So, my ASG does not specify an ELB. Likewise, my ELB does not have any registered EC2 instances.
When I create my ECS service, I choose the ELB and also select the ecsServiceRole. After creating the service, I never see any instances available in the ECS Instances tab. The service also fails to start any tasks, with a very generic error of ...
service was unable to place a task because the resources could not be found.
I've been at this for about two days now and can't seem to figure out what configuration settings are not properly configured. Does anybody have any ideas as to what might be causing this to not work?
Update # 06/25/2015:
I think this may have something to do with the ECS_CLUSTER user data setting.
In my EC2 auto scaling launch configuration, if I leave the user data input completely empty, the instances are created with an ECS_CLUSTER value of "default". When this happens, I see an automatically-created cluster, named "default". In this default cluster, I see the instances and can register tasks with the ELB like expected. My ELB health check (HTTP) passes once the tasks are registered with the ELB and all is good in the world.
But, if I change that ECS_CLUSTER setting to something custom I never see a cluster created with that name. If I manually create a cluster with that name, the instances never become visible within the cluster. I can't ever register tasks with the ELB in this scenario.
Any ideas?
I had similar symptoms but ended up finding the answer in the log files:
/var/log/ecs/ecs-agent.2016-04-06-03:
2016-04-06T03:05:26Z [ERROR] Error registering: AccessDeniedException: User: arn:aws:sts::<removed>:assumed-role/<removed>/<removed is not authorized to perform: ecs:RegisterContainerInstance on resource: arn:aws:ecs:us-west-2:<removed:cluster/MyCluster-PROD
status code: 400, request id: <removed>
In my case, the resource existed but was not accessible. It sounds like OP is pointing at a resource that doesn't exist or isn't visible. Are your clusters and instances in the same region? The logs should confirm the details.
In response to other posts:
You do NOT need public IP addresses.
You do need: the ecsServiceRole or equivalent IAM role assigned to the EC2 instance in order to talk to the ECS service. You must also specify the ECS cluster and can be done via user data during instance launch or launch configuration definition, like so:
#!/bin/bash
echo ECS_CLUSTER=GenericSericeECSClusterPROD >> /etc/ecs/ecs.config
If you fail to do this on newly launched instances, you can do this after the instance has launched and then restart the service.
In the end, it ended up being that my EC2 instances were not being assigned public IP addresses. It appears ECS needs to be able to directly communicate with each EC2 instance, which would require each instance to have a public IP. I was not assigning my container instances public IP addresses because I thought I'd have them all behind a public load balancer, and each container instance would be private.
Another problem that might arise is not assigning a role with the proper policy to the Launch Configuration. My role didn't have the AmazonEC2ContainerServiceforEC2Role policy (or the permissions that it contains) as specified here.
You definitely do not need public IP addresses for each of your private instances. The correct (and safest) way to do this is setup a NAT Gateway and attach that gateway to the routing table that is attached to your private subnet.
This is documented in detail in the VPC documentation, specifically Scenario 2: VPC with Public and Private Subnets (NAT).
It might also be that the ECS agent creates a file in /var/lib/ecs/data that stores the cluster name.
If the agent first starts up with the cluster name of 'default', you'll need to delete this file and then restart the agent.
There where several layers of problems in our case. I will list them out so it might give you some idea of the issues to pursue.
My gaol was to have 1 ECS in 1 host. But ECS forces you to have 2 subnets under your VPC and each have 1 instance of docker host. I was trying to just have 1 docker host in 1 availability zone and could not get it to work.
Then the other issue was that the only one of the subnets had an attached internet facing gateway to it. So one of them was not accessible from public.
The end result was DNS was serving 2 IPs for my ELB. And one of the IPs would work and the other did not. So I was seeing random 404s when accessing the NLB using the public DNS.