Amazon ECS "the referenced cluster was inactive" - amazon-web-services

I followed the steps to install the ECS client on Ubuntu 16, but when I try to run the ECS container agent, it keeps restarting and when I have a look at the logs
2016-12-07T06:01:39Z [INFO] Starting Agent: Amazon ECS Agent - v1.13.1 (efe53c6)
2016-12-07T06:01:39Z [INFO] Loading configuration
2016-12-07T06:01:39Z [INFO] Checkpointing is enabled. Attempting to load state
2016-12-07T06:01:39Z [INFO] Loading state! module="statemanager"
2016-12-07T06:01:39Z [INFO] Event stream ContainerChange start listening...
2016-12-07T06:01:39Z [INFO] Detected Docker versions [1.17 1.18 1.19 1.20 1.21 1.22 1.23]
2016-12-07T06:01:39Z [INFO] Registering Instance with ECS
2016-12-07T06:01:39Z [ERROR] Could not register module="api client" err="ClientException: The referenced cluster was inactive.
status code: 400, request id: 9eaa4124-bc42-11e6-9cf1-7559dea2bdf8"
2016-12-07T06:01:39Z [ERROR] Error registering: ClientException: The referenced cluster was inactive.
status code: 400, request id: 9eaa4124-bc42-11e6-9cf1-7559dea2bdf8
I didn't find a reference for this error on google and I'm wondering what's wrong...
Do I need to create the cluster name on the ECS dashboard ?
I have attacher the container role to my EC2 instance, which allows for cluster creation so I don't think the problem comes from here...
My docker run config
sudo docker run --name ecs-agent \
--detach=true \
--restart=on-failure:10 \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--volume=/var/log/ecs/:/log \
--volume=/var/lib/ecs/data:/data \
--net=host \
--env=ECS_LOGFILE=/var/log/ecs-agent.log \
--env=ECS_LOGLEVEL=info \
--env=ECS_DATADIR=/data \
--env=ECS_CLUSTER=my-cluster \
--env=ECS_ENABLE_TASK_IAM_ROLE=true \
--env=ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true \
amazon/amazon-ecs-agent:latest

You need to call aws ecs create-cluster --region $REGION --cluster my-cluster, call the CreateCluster API through the SDK, or create it in the console. The ECS agent will only automatically create a cluster named default, and only when ECS_CLUSTER is unspecified.

Related

Can I add a wait when using aws ecs run-task command in my deployment pipeline

I am using CircleCI for my CI/CD along with CodeDeploy. I would like to run an ecs run-task command and would like the task to complete before moving on to the more intricate deployment stages, which we use CodeDeploy for, and is triggered through the CircleCI config. In a previous version of the aws cli the --wait flag was an option for this, but is not an option in aws version 2+. Are there any other simple alternatives that people are using to get around this?
Adding my solution here thanks to Mark B's response.
TASK_ID=$(aws ecs run-task \
--profile staging \
--cluster <cluster-name> \
--task-definition <task-definition> \
--count 1 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNET_ONE_STAGING, $SUBNET_TWO_STAGING, $SUBNET_THREE_STAGING],securityGroups=[$SECURITY_GROUP_IDS_STAGING],assignPublicIp=ENABLED}" \
--overrides '{"containerOverrides":[{"name": "my-app","command": ["/bin/sh", "-c", "bundle exec rake db:migrate && bundle exec rake after_party:run"]}]}' \
| jq -r '.tasks[0].taskArn') \
&& aws ecs wait tasks-stopped --cluster <cluster-name> --tasks ${TASK_ID}
You would use the aws ecs wait capability in the CLI. Note that this is the same in version 1 of the CLI and version 2, there was never a --wait for ECS tasks in the core AWS CLI as far as I'm aware.
Specifically, after starting the task and getting the task ID returned from the run-task command, you would use aws ecs wait task-stopped --tasks <task-id> to wait for the task to be done/stopped.

AWS EMR - Terminated with errors On the master instance application provisioning failed

I'm provisioning an EMR cluster emr-5.30.0. I run this using Terraform and get the following error on AWS CONSOLE as it fails.
Amazon EMR Cluster j-11I5FOBxxxxxx has terminated with errors at 2020-10-26 19:51 UTC with a reason of BOOTSTRAP_FAILURE.
I don't have any bootstrap steps. I can't view any logs either to see what is happening. Log URI is blank and can't SSH to cluster too since it's terminated.
Any pointers would be appreciated?
Providing AWS-CLI-EXPORT output:
aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications Name=Spark --tags 'Account=xxx' 'Function=xxx' 'Repository=' 'Mail=xxx#xxx.com' 'Slack=xxx' 'Builder=xxx' 'Environment=xxx' 'Service=xxx xxx xxx' 'Team=xxx' 'Name=xxx-xxx-xxx' --ebs-root-volume-size 100 --ec2-attributes '{"KeyName":"xxx","AdditionalSlaveSecurityGroups":[""],"InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"sg-xxx","SubnetId":"subnet-xxx","EmrManagedSlaveSecurityGroup":"sg-xxx","EmrManagedMasterSecurityGroup":"sg-xxx","AdditionalMasterSecurityGroups":[""]}' --service-role EMR_DefaultRole --release-label emr-5.30.0 --name 'xxx-xxx-xxx' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":""},{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":40,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m5.2xlarge","Name":""}]' --configurations '[{"Classification":"hadoop-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3","JAVA_HOME":"/usr/lib/jvm/java-1.8.0"}}]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"PYSPARK_PYTHON":"/usr/bin/python3","JAVA_HOME":"/usr/lib/jvm/java-1.8.0"}}]}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region eu-west-2
Issue was due to JAVA_HOME incorrectly set.
JAVA_HOME":"/usr/lib/jvm/java-1.8.0"
Resolution: Check logs in S3 under: provision-node/reports and it should tell you which bootstrap step fails...
Try to change the instance type and try running it in different AZ and see if problem persists.
Building a cluster with emr-6.2.0 on md5.xlarge, this is JAVA_HOME:
/usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64

How to pull a private container from AWS ECR to a local cluster

I am currently having trouble trying to pull my remote docker image hosted via AWS ECR. I am getting this error when running a deployment
Step 1)
run
aws ecr get-login-password --region cn-north-1 | docker login --username AWS --password-stdin xxxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn
Step 2)
run kubectl create -f backend.yaml
from here the following happens:
➜ backend git:(kubernetes-fresh) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
backend-89d75f7df-qwqdq 0/1 Pending 0 2s
➜ backend git:(kubernetes-fresh) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
backend-89d75f7df-qwqdq 0/1 ContainerCreating 0 4s
➜ backend git:(kubernetes-fresh) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
backend-89d75f7df-qwqdq 0/1 ErrImagePull 0 6s
➜ backend git:(kubernetes-fresh) ✗ kubectl get pods
NAME READY STATUS RESTARTS AGE
backend-89d75f7df-qwqdq 0/1 ImagePullBackOff 0 7s
So then I run kubectl describe pod backend and it will output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 117s default-scheduler Successfully assigned default/backend-89d75f7df-qwqdq to minikube
Normal Pulling 32s (x4 over 114s) kubelet, minikube Pulling image "xxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/baopals:latest"
Warning Failed 31s (x4 over 114s) kubelet, minikube Failed to pull image "xxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/baopals:latest": rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/v2/baopals/manifests/latest: no basic auth credentials
Warning Failed 31s (x4 over 114s) kubelet, minikube Error: ErrImagePull
Warning Failed 19s (x6 over 113s) kubelet, minikube Error: ImagePullBackOff
Normal BackOff 4s (x7 over 113s) kubelet, minikube Back-off pulling image "xxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/baopals:latest"
the main error being no basic auth credentials
Now what I am confused about is that I can push images to my ECR fine and I can also push to my remote EKS cluster I feel like essentially the only thing I cant do right now is pull from my private repository that is hosted on ECR.
Is there something obvious that I'm missing here that is preventing me from pulling from private repos so i can use them on my local machine?
For fetching ECR image locally you have login to ECR and fetch docker image. while if you are on Kubernetes you have to use secret for storing ECR login details and use it each time for pulling image from ECR.
here shell script if you are on Kubernetes, it will automatically take values from AWS configuration or else you can update variables at starting of script.
ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text) #aws account number
REGION=ap-south-1 #aws ECR region
SECRET_NAME=${REGION}-ecr-registry #secret_name
EMAIL=abc#xyz.com #can be anything
TOKEN=`aws ecr --region=$REGION get-authorization-token --output text --query authorizationData[].authorizationToken | base64 -d | cut -d: -f2`
kubectl delete secret --ignore-not-found $SECRET_NAME
kubectl create secret docker-registry $SECRET_NAME \
--docker-server=https://$ACCOUNT.dkr.ecr.${REGION}.amazonaws.com \
--docker-username=AWS \
--docker-password="${TOKEN}" \
--docker-email="${EMAIL}"
imagePullSecret used in YAML file for pulling storing secret for private docker repos.
https://github.com/harsh4870/ECR-Token-automation/blob/master/aws-token.sh
When a node in your cluster launches a container, it needs the credentials to access the private registry to pull the image. Even if you have authenticated in your local machine, the node cannot reuse the login, because by design it could be running on another machine; so you have to provide the credentials in the pod template. Follow this guide to do that:
https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
Basically you store the ECR credentials as a secret and provide it in the imagePullSecret of the container spec. The pod will then be able to to pull the image everytime.
If you are developing with your cluster running on local machine, you don't even need to do that. You can have the pod reuse the image that you have downloaded to your local cache by either setting the imagePullPolicy under container spec to IfNotPresent, or using a specific tag instead of latest for your image.

Amazon ECS agent on ubuntu not starting

I am currently trying to build a custom ubuntu ami for AWS batch and following the document mentioned here
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html
However when I try to start the docker agent on that machine it always keeps giving me this error
2018-07-04T23:34:01Z [INFO] Amazon ECS agent Version: 1.18.0, Commit: c0defea9
2018-07-04T23:34:01Z [INFO] Loading state! module="statemanager"
2018-07-04T23:34:01Z [INFO] Event stream ContainerChange start listening...
2018-07-04T23:34:01Z [INFO] Creating root ecs cgroup: /ecs
2018-07-04T23:34:01Z [INFO] Creating cgroup /ecs
2018-07-04T23:34:01Z [WARN] Disabling TaskCPUMemLimit because agent is unabled to setup '/ecs' cgroup: cgroup create: unable to create controller: mkdir /sys/fs/cgroup/systemd/ecs: read-only file system
2018-07-04T23:34:01Z [WARN] Error getting valid credentials (AKID ): NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2018-07-04T23:34:01Z [INFO] Registering Instance with ECS
2018-07-04T23:34:01Z [ERROR] Could not register: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2018-07-04T23:34:01Z [ERROR] Error registering: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors
I made sure the instance has the ecsInstanceRole associated with that.
Can you guys let me know what I am missing?
Not certain how you are starting the ecs-agent. Ran into the error of
Disabling TaskCPUMemLimit because agent is unabled to setup '/ecs cgroup: cgroup create: unable to create controller: /sys/fs/cgroup/systemd/ecs: read-only file system
We resolved this by adding the volume --volume=/sys/fs/cgroup:/sys/fs/cgroup:ro to the systemd unit file that we having launching ecs.
Outside of that, I assume the issue resides with the ecsInstanceRole. Can you verify it has the following permissions? AmazonEC2ContainerRegistryFullAccess, AmazonEC2ContainerServiceFullAccess, AmazonEC2ContainerServiceforEC2Role
Below is the full systemd file for ecs-agent.
[Unit]
Description=Docker Container %I
Requires=docker.service
After=docker.service
[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log:Z \
--volume=/var/lib/ecs/data:/data:Z \
--volume=/etc/ecs:/etc/ecs \
--volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
--net=host \
--env-file=/etc/ecs/ecs.config \
--env LOGSPOUT=ignore \
amazon/amazon-ecs-agent:latest
ExecStop=/usr/bin/docker stop %i
[Install]
WantedBy=default.target
I ran into the same messages. You need to create the IAM role and launch the instance with that role, per this documentation: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html

error while running update environment in elastic beanstalk

Hi Could someone help with the following error on one of our elastic beanstalk application please.
ERROR Docker container quit unexpectedly after launch: lipse.jetty.server.Server.doStart(Server.java:431) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at winstone.Launcher.<init>(Launcher.java:152) ... 7 more. Check snapshot logs for details.
[Instance: i--------] Command failed on instance. Return code: 1 Output: (TRUNCATED)...xpectedly after launch: lipse.jetty.server.Server.doStart(Server.java:431) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at winstone.Launcher.<init>(Launcher.java:152) ... 7 more. Check snapshot logs for details. Hook /opt/elasticbeanstalk/hooks/appdeploy/enact/00run.sh failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
This error occurs when running the following update command is run:
aws elasticbeanstalk update-environment \
--application-name app_name \
--environment-name --------- \
--version-label ------- \
--template-name ------- \
--region eu-west-1
Thanks