AWS ECS task healthcheck always failed - amazon-web-services

I made a task defination in AWS ECS as shown in screenshots:
Now when I run taskdefination in a cluster, it successfully run, but status of container remains unhealthy forever. And when I try same command(healthcheck command[curl command]) running inside container, I am able to run same command inside container, then it run successfully. I had also tried CMD instead of CMD-SHELL , but nothing working. Inside container apache is running at port 80.
Note I am making docker image by committing docker container not by Dockerfile.
Not getting why healthcheck is not working. Nothing significant found online. Please help if someone before had faced this issue.

You have made a mistake in command in HEALTHCHECK section (double pipe).
I suppose you want to use CMD-SHELL,curl --fail http://localhost/ || exit 1 instead of ...|exit 1|

Remove the double quotes around the command, should work

Related

Allow a bash script to run at boot in AWS Centos 7 instance

I need to create AWS CentOS 7 instance images for a customer, and need it to automatically send the ip and instance id to our AWS server every time the instance boots. For example, this is the very basic test version of the script I need to run:
#!/bin/bash
$serverIP=""
curl "https://$serverIP"/myphp.php?id='sentid'&ip='sentip'"
If the script is run directly, it works fine and is received by the server and processed there. But I can't get it to run at boot. I cannot put the script in the "User Data" directly due to security concerns as the customer can then see it easily, it needs to be in a script in the filesystem of the image.
I've tried several things that work fine on a physical Linux server, but not on AWS. I know profile.d runs every time someone logs in but over-sending like that is fine.
/etc/profile.d/myscript.sh
This stops the AWS instance from booting. Even just
#!/bin/bash/
echo "hello world"
prevents it from booting. The instance starts, but when you go to ssh into it you get 'Network Error: connection timed out', which is the standard error if you put a wrong ip in, or upset it by leaving a service like httpd enabled.
However, a blank bash script with just #!/bin/bash will allow the instance to start. Removing the script via user data usually makes it boot, sometimes it just dies.
The first thing I tried was crontab. I did:
crontab -e
#reboot /var/ook/myscript.sh
systemctl enable crond.service
But the instance wouldn't start. So I put "systemctl disable crond.service" in the User Data and one booted, but another still stayed dead. Myscript.sh was just another echo "doob" >> file which worked fine when run directly.
I tried putting in /etc/systemd/system/my-startup.service:
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/var/ook/writedood.sh
[Install]
WantedBy=multi-user.target
then:
systemctl enable my-startup.service
But this did nothing. My script "writedood.sh" was just echo "doob" >> ./file.txt ensuring file.txt was chmod 777. At least it didn't prevent the instance from starting.
To give context, an instance won't start if httpd is left enabled on shutdown, but will if you disable it in User Data.
I wanted to have a go at putting something in init.d but I'm not sure how to simply tell it to run a script once in the background, and given the plethora of success I've had so far with the instance not restarting, I'm not holding out much hope that that would work.
Thanks in advance!
EDIT::: I realised that sometimes AWS EC2 Instances Console is causing the problem where I can't ssh in after stopping and starting. It blanks the public ipv4 address when I click stop, but when I start, it puts the old address up and hangs. If I refresh the page, or uncheck/check the instance; the ip changes to the new address. This has caused much consternation.
Crontab worked if I placed the scripts and output file in different folders. It's very finicky; any errors, such as it not being able to write to the output file, and the instance won't start. I put startscript.sh in /usr/local/src, and output.out to /tmp/ to ensure there were no permissions problems, and now the instance starts and runs the script on boot.
I then realised that sometimes AWS EC2 Instances Console is causing the problem where I can't ssh in after stopping and starting. It blanks the public ipv4 address when I click stop, but when I start, it puts the old address up and hangs. If I refresh the page, or uncheck/check the instance; the ip changes to the new address. This has caused much consternation.

How to keep a container running in ECS

We are deploying an application in ECS, which is exiting due to some error.
We need to log in to the container and check the logs, however the container stops when the application exits after the error.
How can I keep the container running so that I can ssh in to it?
I tried using tail -f /dev/null in the startup.sh script, which is run at the container startup.
I need to run the startup.sh script to configure the SSH, etc.
However, it looks like executing tail -f /dev/null at the end of the scripts does not seem to keep the container running.
Appreciate any advice on how to keep an ECS container running.

Docker executable not found in PATH when using AWS batch/ECS

I am trying to run a simple Dockerized Python script with AWS batch.
Is there a problem with my Docker image?
I have locally built the Docker image and it runs fine locally. I pushed the image to a AWS repository, and pulling this remote image to my local machine also runs correctly.
Problem
I have setup my compute env, job queue, and job definition, but I get this error
CannotStartContainerError: Error response from daemon:
OCI runtime create failed: container_linux.go:370:
starting container process caused:
exec: "docker": executable file not found in $PATH: unknown
when I run
["docker","run","-t","111111111111.dkr.ecr.us-region-X.amazonaws.com/myimage:latest","python3","hello_world.py","--MSG","ok"]
Is Docker installed?
I am using the ECS_AL2 image type. When I start a EC2 with this AMI and ssh into it, I can see that Docker is already installed. docker run works fine for instance.
Is there a (generic) problem with my compute env, job queue, or job def?
When I instead try to run the command echo hello this works fine.
Appreciate any advice/help you can provide.
UPDATE - ANSWER
#samtoddler helped me to realize that I only needed
["python3","hello_world.py","--MSG","ok"]
in the Command statement
this error
CannotStartContainerError: Error response from daemon:
that means it is coming from docker daemon, so docker is doing its job.
Seems like you have some trouble with your docker image, how it is packaged and how you trying to pass all those vars.
Please check Docker Image CMD section on how to use ENTRYPOINT and CMD.
There is some explanation in this question docker-oci-runtime-create-failed-container-linux-go349-starting-container-pro

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

Running Docker container randomly disappears on AWS EC2 Ubuntu

I'm running Docker on a t2.micro AWS EC2 instance with Ubuntu.
I'm running several containers. One of my long-running containers (always the same) just disappeared after running about 2-5 days for the third time right now. It is just gone with no sign of a crash.
The machine has not been restarted (uptime says 15 days).
I do not use the --rm flag: docker run -d --name mycontainer myimage.
There is no exited zombie of this container when running docker ps -a.
There is no log, i.e. docker logs mycontainer does not find any container.
There is no log entry in journalctl -u docker.service within the time frame
where the container disappears. However, there are some other log entries
regarding another container (let's call it othercontainer) which are
occuring repeatedly about every 6 minutes (it's a cronjob, don't know if relevant):
could not remove cluster networks: This node is not a swarm manager. Use
"docker swarm init" or "docker swarm join" to connect this node to swarm
and try again
Handler for GET /v1.24/networks/othercontainer_default returned error:
network othercontainer_default not found
Firewalld running: false
Even if there would be e.g. an out-of-memory issue or if my application just exits, I would still have an exited Docker container zombie in the ps -a overview, probably with exist status 0 or != 0, right?
I also don't want to --restart automatically, I just want to see the exited container.
Where can I look for more details to trace the issue?
Versions:
OS: Ubuntu 16.04.2 LTS (Kernel: 4.4.0-1013-aws)
Docker: Docker version 17.03.1-ce, build c6d412e
Thanks to a hint to look at dmesg or maybe the general journalctl I think I finally found the issue.
Somehow, one of the cronjobs has been running docker system prune -f at its end every 5 minutes. This command basically seems to remove everything unused and non-running.
I didn't know about this command before but certainly this has to be the way how my exited containers got removed without me knowing how it happened.