AWS ECS tasks keep starting and stopping

AWS ECS tasks keep starting and stopping - amazon-web-services

I am trying to use ECS for deployment with travis.
At one point everything was working but now it stopped.
I am following this tutorial https://testdriven.io/part-five-ec2-container-service/
There are 2 tasks that keep stopping and starting.
These are the messages I see in tasks:
STOPPED (CannotStartContainerError: API error (500): oci ru)
STOPPED (Essential container in task exited)
These are the messages I see in the logs:
FATAL: could not write to file "pg_wal/xlogtemp.28": No space left on device
container_linux.go:262: starting container process caused "exec: \"./entrypoint.sh\": permission denied"
Why is ECS stopping and starting so many new tasks? This was not happening before.
This is my docker_deploy.sh from my main microservice which I am calling via travis.
#!/bin/sh
if [ -z "$TRAVIS_PULL_REQUEST" ] || [ "$TRAVIS_PULL_REQUEST" == "false" ];
then
if [ "$TRAVIS_BRANCH" == "staging" ];
then
JQ="jq --raw-output --exit-status"
configure_aws_cli() {
aws --version
aws configure set default.region us-east-1
aws configure set default.output json
echo "AWS Configured!"
}
make_task_def() {
task_template=$(cat ecs_taskdefinition.json)
task_def=$(printf "$task_template" $AWS_ACCOUNT_ID $AWS_ACCOUNT_ID)
echo "$task_def"
}
register_definition() {
if revision=$(aws ecs register-task-definition --cli-input-json "$task_def" --family $family | $JQ '.taskDefinition.taskDefinitionArn');
then
echo "Revision: $revision"
else
echo "Failed to register task definition"
return 1
fi
}
deploy_cluster() {
family="testdriven-staging"
cluster="ezasdf-staging"
service="ezasdf-staging"
make_task_def
register_definition
if [[ $(aws ecs update-service --cluster $cluster --service $service --task-definition $revision | $JQ '.service.taskDefinition') != $revision ]];
then
echo "Error updating service."
return 1
fi
}
configure_aws_cli
deploy_cluster
fi
fi
This is my Dockerfile from my users microservice:
FROM python:3.6.2
# install environment dependencies
RUN apt-get update -yqq \
&& apt-get install -yqq --no-install-recommends \
netcat \
&& apt-get -q clean
# set working directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
# add requirements (to leverage Docker cache)
ADD ./requirements.txt /usr/src/app/requirements.txt
# install requirements
RUN pip install -r requirements.txt
# add entrypoint.sh
ADD ./entrypoint.sh /usr/src/app/entrypoint.sh
RUN chmod +x /usr/src/app/entrypoint.sh
# add app
ADD . /usr/src/app
# run server
CMD ["./entrypoint.sh"]
entrypoint.sh:
#!/bin/sh
echo "Waiting for postgres..."
while ! nc -z users-db 5432;
do
sleep 0.1
done
echo "PostgreSQL started"
python manage.py recreate_db
python manage.py seed_db
gunicorn -b 0.0.0.0:5000 manage:app
I tried deleting my cluster and deregistering my tasks and restarting but ECS still continuously stops and starts new tasks now.
When it was working fine: the difference was that instead of the CMD ["./entrypoint.sh"] in my Dockerfile, I had
RUN python manage.py recreate_db
RUN python manage.py seed_db
CMD gunicorn -b 0.0.0.0:5000 manage:app
travis is passing.

The errors are right there.
You don't have enough space on your host; and the entrypoint.sh file is being denied.
Ensure your host has enough disk space (Shell in and df -h to check and expand the volume or just bring up a new instance with more space) and for the entrypoint.sh ensure that when building your image it is executable chmod +x and also is readable by the user the container is running as.
Test your containers locally first; the second error should have been caught in development instantly.

I realize this answer isn't 100% relevant to the question asked, but some googling brought me here due to the title and I figure my solution might help someone later down the line.
I also had this issue, but the reason why my containers kept restarting wasn't a lack of space or other resources, it was because I had enabled dynamic host port mapping and forgotten to update my security group as needed. What happened then is that the health checks my load balancer sent to my containers inevitably failed and ECS restarted the containers (whoops).
Dynamic Port Mapping in AWS Documentation:
https://aws.amazon.com/premiumsupport/knowledge-center/dynamic-port-mapping-ecs/
https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_PortMapping.html Contents --> hostPort
tl;dr - Make sure your load balancer can health check ports 32768 - 65535.

If it's too many tasks running and they have consumed the space then you will need to shell in to the host and do the following. Don't use -f on the docker rm as that will remove the running ECS agent container
docker rm $(docker ps -aq)

Do docker ps -a
Which results in all the stopped containers which are excited, these also consumes disk space.use below command to remove those zoombie
docker rm $(docker ps -a | grep Exited | awk '{print $1}')
And also remove older images or unused images these takes more DiskStation size than containers
docker rmi -f image_name

Related

Where do I put `.aws/credentials` for Docker awslogs log-driver (and avoid NoCredentialProviders)?

The Docker awslogs documentation states:
the default AWS shared credentials file (~/.aws/credentials of the root user)
Yet if I copy my AWS credentials file there:
sudo bash -c 'mkdir -p $HOME/.aws; cp .aws/credentials $HOME/.aws/credentials'
... and then try to use the driver:
docker run --log-driver=awslogs --log-opt awslogs-group=neiltest-deleteme --rm hello-world
The result is still the dreaded error:
docker: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors.
Where does this file really need to go? Is it because the Docker daemon isn't running as root but rather some other user and, if so, how do I determine that user?
NOTE: I can work around this on systems using systemd by setting environment variables. But this doesn't work on Google CloudShell where the Docker daemon has been started by some other method.

Ah ha! I figured it out and tested this on Debian Linux (on my Chromebook w/ Linux VM and Google CloudShell):
The .aws folder must be in the root folder of the root user not in the $HOME folder!
Based on that I was able to successfully run the following:
pushd $HOME; sudo bash -c 'mkdir -p /.aws; cp .aws/* /.aws/'; popd
docker run --log-driver=awslogs --log-opt awslogs-region=us-east-1 --log-opt awslogs-group=neiltest-deleteme --rm hello-world
I initially figured this all out by looking at the Docker daemon's process information:
DOCKERD_PID=$(ps -A | grep dockerd | grep -Eo '[0-9]+' | head -n 1)
sudo cat /proc/$DOCKERD_PID/environ
The confusing bit is that Docker's documentation here is wrong:
the default AWS shared credentials file (~/.aws/credentials of the root user)
The true location is /.aws/credentials. I believe this is because the daemon starts before $HOME is actually defined since it's not running as a user process. So starting a shell as root will tell you a different story for tilde or $HOME:
sudo sh -c 'cd ~/; echo $PWD'
That outputs /root but using /root/.aws/credentials does not work!

How to stop container if it exist otherwise do nothing in gitlab pipeline

I am trying to run docker image inside ec2 instance using gitlab CI/CD.
Trying to expose 5000 port for the application.
But i am aware of the face this job will work for the first time, but for the susequent runs the job will fail, as docker does not allow to run image on the same port, so i am trying to implement a fail safe mechanism where where running it checks for the process, if it exist, it will stop and remove container and then run the image on port 5000.
Here i am facing the problem that if this job runs for the first time docker stop needs at least one argument in the current command.
is there a way to run this command in a if condition basis, like if process exist then only run otherwise dont.
deploy:
stage: deploy
before_script:
- chmod 400 $SSH_KEY
script: ssh -o StrictHostKeyChecking=no -i $SSH_KEY ec2-user#ecxxxxx-xxxx.ap-southeast-1.compute.amazonaws.com "
docker login -u $REGISTRY_USER -p $REGISTRY_PASS &&
docker ps -aq | xargs docker stop | xargs docker rm &&
docker run -d -p 5000:5000 $IMAGE_NAME:$IMAGE_TAG"
error on pipeline
"docker stop" requires at least 1 argument.
See 'docker stop --help'.
Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...]
Stop one or more running containers
"docker rm" requires at least 1 argument.
See 'docker rm --help'.
Usage: docker rm [OPTIONS] CONTAINER [CONTAINER...]
Remove one or more containers
The problem is with xargs docker stop | xargs docker rm command. is there a way to solve this kind of problem
Edit :- This doesn't exactly answer my question because what if a junior engineer is assigned this task to setup a pipeline who doesn't know the name of image, this solution requires us to know the name of the image, in this case this won't work.

Here what I understood is you are not stopping image but you are stopping container and removing it and then creating new container with the expose port 5000.
So give a variable constant name to container which will be same whenever it creates. The "|| true" helps you to stop the container only if it exists if not it won't stop any container
variables:
CONTAINER_NAME: <your-container-name> #please give a name for container to be created for this image
deploy:
stage: deploy
before_script:
- chmod 400 $SSH_KEY
script: ssh -o StrictHostKeyChecking=no -i $SSH_KEY ec2-user#ecxxxxx-xxxx.ap-southeast-1.compute.amazonaws.com "
docker login -u $REGISTRY_USER -p $REGISTRY_PASS" &&
docker stop $CONTAINER_NAME; docker rm $CONTAINER_NAME || true &&
docker run -d -p 5000:5000 --name $CONTAINER_NAME $IMAGE_NAME:$IMAGE_TAG"

Dockerfile Entrypoint - still exiting with exec $#?

I'm working on a container to use megacmd (CLI syncing utility from Mega.nz, storage provider).
Relatively new to Dockerfiles, I've successfully made a dockerfile that will install MegaCMD, and login, but once it does that, it stops the container.
In my compose file I have set tty: true, thinking that would keep it alive, but it does not.
FROM ubuntu:groovy
ENV email=email#example.com
ENV password=notyourpassword
RUN apt-get update \
....more stuff here
COPY megalogin.sh /usr/bin/local/megalogin.sh
ENTRYPOINT ["sh", "/usr/bin/local/megalogin.sh"]
####Works up to here but the container still stops when finished the login script
megalogin.sh
#!/bin/sh
mega-login ${email} ${password}
mega-whoami
What do I need to do to make this thing to stay running?
I have tried the exec "$#" at the end of the script but that didnt make any difference.

When you run your container append the tail -f /dev/null to the docker run command e.g.
docker run -d [container-name] tail -f /dev/null
You should then be able to exec into the running container using docker exec [container-name] /bin/bash

So not the exact best solution, but in the compose file I put:
And it worked.
tty: true
stdin_open: true

AWS Cloudwatch Agent in a docker container

I am trying to set up Amazon Cloudwatch Agent to my docker as a container. This is an OnPremise installation so it's running locally, not inside AWS Kubernetes or anything of the sorts.
I've set up a basic dockerfile, agent.json and .aws/ folder for credentials and using docker-compose build to actually set it up, then launch it, but I am running into constant problems because Docker does not contain or run systemctl so I cannot run the service using AWS own documentation command:
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m onPremise -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s
This will fail on an error when I try to run the container:
cloudwatch_1 | /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl: line 262: systemctl: command not found
cloudwatch_1 | unknown init system
I've tried to run the /start-amazon-cloudwatch-agent inside /bin as well, but no luck. No documentation on this.
Basically the issue is how can I run this as a service or a process in the foreground? Anyone have any clues? Otherwise the container won't stay up. Below is my code:
dockerfile
FROM amazonlinux:2.0.20190508
RUN yum -y install https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
COPY agent.json /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
CMD /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m onPremise -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
agent.json
{
"agent": {
"metrics_collection_interval": 60,
"region": "eu-west-1",
"logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log",
"debug": true
}
}
.aws/ folder contains config and credentials, but I never got as far for the agent to actually try and make a connection.

just use the official image docker pull amazon/cloudwatch-agent it will handel all the things for you
here
if you insist to use your own , try the following:
FROM amazonlinux:2.0.20190508
RUN yum -y install https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
COPY agent.json /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json
ENV RUN_IN_CONTAINER=True
ENTRYPOINT ["/opt/aws/amazon-cloudwatch-agent/bin/start-amazon-cloudwatch-agent"]

Use the AWS official Docker Image, here is the example of the docker compose
version: "3.8"
services:
agent:
image: amazon/cloudwatch-agent:1.247350.0b251814
volumes:
- ./config/log-collect.json:/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json # agent config
- ./aws:/root/.aws # required for authentication
- ./log:/log # sample log
- ./etc:/opt/aws/amazon-cloudwatch-agent/etc # for debugging the config of AWS of container
From config above, only the first 2 volume sync required.
Number 3 & 4 is for debug purpose.
If you interested in learning what each volumes does, you can read more at https://medium.com/#gusdecool/setup-aws-cloudwatch-agent-on-premise-server-part-1-31700e81ab8

Django migrations with Docker on AWS Elastic Beanstalk

I have a django app running inside a single docker container on AWS Elastic Beanstalk. I cannot get it to run migrations properly, it always sees the old docker image and tries to run migrations from that (but it doesn’t have the latest files).
I package an .ebextensions directory with my EBS source bundle (a zip containing a Dockerrun.aws.json file and the .ebextensions dir). And it has a setup.config file that looks like this:
container_commands:
01_migrate:
command: "CONTAINER=`docker ps -a --no-trunc | grep aws_beanstalk | cut -d' ' -f1 | head -1` && docker exec $CONTAINER python3 manage.py migrate"
leader_only: true
Which is partially modeled after the comments on this SO question.
I have verified that it can work if I simply re-deploy the app a second time, since this time the previous running image will have the updated migrations file.
Does anyone know how to access the latest docker image or latest running container in an .ebextensions script?

Based on AWS Documentation on Customizing Software on Linux Servers, container_commands will be executed before your app is deployed.
You can use the container_commands key to execute commands for your container. The commands in container_commands are processed in alphabetical order by name. They run after the application and web server have been set up and the application version file has been extracted, but before the application version is deployed. They also have access to environment variables such as your AWS security credentials. Additionally, you can use leader_only. One instance is chosen to be the leader in an Auto Scaling group. If the leader_only value is set to true, the command runs only on the instance that is marked as the leader.
Take a look also into my answer in here. It run some command in different app deployment state and give the command result.
So, your problem solution might be create an post app deployment hook.
.ebextensions/00_post_migrate.config
files:
"/opt/elasticbeanstalk/hooks/appdeploy/post/10_post_migrate.sh":
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/env bash
if [ -f /tmp/leader_only ]
then
rm /tmp/leader_only
docker exec `docker ps --no-trunc -q | head -n 1` python3 manage.py migrate
fi
container_commands:
01_migrate:
command: "touch /tmp/leader_only"
leader_only: true

I am using another approach. What I did is run a container based on the newly build image, then pass in the environment variables from Elastic Beanstalk and run the custom command in that container. When that command is done, it will remove itself and proceed with the deployment.
So this is the script I have put inside .ebextensions/scripts/container_command.sh (make sure you replace everything that is within <>):
#!/bin/bash
COMMAND=$1
EB_CONFIG_DOCKER_IMAGE_STAGING=$(/opt/elasticbeanstalk/bin/get-config container -k <environment_name>_image)
EB_SUPPORT_FILES=$(/opt/elasticbeanstalk/bin/get-config container -k support_files_dir)
# build --env arguments for docker from env var settings
EB_CONFIG_DOCKER_ENV_ARGS=()
while read -r ENV_VAR; do
EB_CONFIG_DOCKER_ENV_ARGS+=(--env "${ENV_VAR}")
done < <($EB_SUPPORT_FILES/generate_env)
docker run --name=shopblender_pre_deploy -d \
"${EB_CONFIG_DOCKER_ENV_ARGS[#]}" \
"${EB_CONFIG_DOCKER_IMAGE_STAGING}"
docker exec shopblender_pre_deploy ${COMMAND}
# clean up
docker stop shopblender_pre_deploy
docker rm shopblender_pre_deploy
Now, you can use this script to execute any custom command to the container that will be deployed later.
Something like this .ebextensions/container_commands.config:
container_commands:
01-command:
command: bash .ebextensions/scripts/container_command.sh "php app/console doctrine:schema:update --force --no-interaction" &>> /var/log/database.log
leader_only: true
02-command:
command: bash .ebextensions/scripts/container_command.sh "php app/console fos:elastica:reset --no-interaction" &>> /var/log/database.log
leader_only: true
03-command:
command: bash .ebextensions/scripts/container_command.sh "php app/console doctrine:fixtures:load --no-interaction" &>> /var/log/database.log
leader_only: true
This way you also do not need to worry about what your latest started container is, which is a problem with the solution described above.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js