How can you implement incremental / scheduled docker shutdown commands? - amazon-web-services

I'm spinning up a docker container once a day on an EC2 instance to run some scheduled tasks. I've noticed over time that the unused docker containers will use up disk space. Occasionally, the EC2 will raise a failure because of a lack of disk space.
In this case I generally run the command docker system df to see how much disk space docker is using on the EC2, and then run docker system prune to remove all those unused containers and free up disk space.
I'd like to be able to automate the second command, however I probably want to retain logs for a couple of days/weeks.
Is there a way to schedule / have filtering so say like "remove unused docker containers older than X number of days. or only keep 30 unused containers, remove the oldest one when we reach the limit of 30.?

Related

AWS ec2 instance becomes unresponsive after I/O heavy operation in dockerfile

I'm using free tier ec2-instance (t2.micro) with EBS volume (default one). I use a docker container to host a website.
Occasionally, when I run out of storage, I have to bulk delete my dangling docker images. On performing delete and rebuilding the container ssh instance hangs (while installing the npm modules), and I'm not able to even log into my machine for almost 1-2 hours.
On research, I realized this has something to do with burst credits, but on inspecting my EBS burst credits I still have 60 credits left. And I have around 90 CPU credits.
Not sure why this unresponsiveness in happening, my instance even stops serving the website it's running after this for 1-2 hours.
For reference, this is my Dockerfile.

Problems with Memory and CPU limits in AWS ECS cluster running on reserved EC2 instance

I am running the ECS cluster that currently has 3 services running on T3 medium instance. Each of those services is running only one task which has a soft memory limit of 1GB, the hard limit is different for each (but that should not be the problem). I will always have enough memory to run one, new deployed task (new one will also take 1GB, and T3 medium will be able to handle it since it has 4GB total). After the new task is up and running, the old one will be stopped and I will have again 1GB free for the new deployment. I did similar to the CPU (2048 CPU, each task has 512, and 512 free for new deployments).
So everything runs fine now, but I am not completely satisfied with this setup for the future. What will happen if I need to add another service with another task? I need to deploy all existing tasks and to modify their task definitions to use less CPU and memory in order to run this new task (and new deployments). I am planning to get a reserved EC2 instance, so it will not be easy to swap the current EC2 instance with the larger one.
Is there a way to spin up another EC2 instance for the same ECS cluster to handle bursts in my tasks? Also deployments, it's not a perfect scenario to have the ability to deploy only one task, and then wait for old to be killed in order to deploy the next one, without downtimes.
And biggest concern, what if I need new service and task, I need again to adjust all others in order to run a new one and deploy others, which is not very maintainable and what if I cannot lower CPU and memory more because I already reached the lowest point in order to run the task smoothly.
I was thinking about having another EC2 instance for the same cluster, that will handle bursts, deployments, and new services/tasks. But not sure if that's possible and if that's the best way of doing this. I was also thinking about Fargate, but this is much more expensive and I cannot afford it for now. What do you think? Any ideas, suggestions, and hints will be helpful since I am desperate to find the best way to avoid the problems mentioned above.
Thanks in advance!
So unfortunately, there is no out of the box solution to ensure that all your tasks run on min possible (i.e. one) instance. You can use our new feature called Capacity Providers (CP), which will allow you to ensure the minimum number of ec2 instances required to run all your tasks. The major difference between CP vs ASG is that CP gives more weight to task placement (where as ASG will scale in/out based on resource utilization which isn't ideal in your case).
However, it's not an ideal solution. Just as you said in your comment, when the service needs to scale out during a deployment, CP will spin up another instance, the new task will be placed on it and once it gets to Running state, the old task will be stopped.
But now you have an "extra" EC2 instance because there is no way to replace a running task. The only way I can think of would be to use a lambda function that drains the new instance, which will move all the service tasks to the other instance. CP will, after about 15 minutes, terminate this instance as there are no tasks are running on it.
A couple caveats:
CP are new, a little rough around the edges, and you can't
delete/modify them. You can only create or deactivate them.
CP needs an underlying ASG and they must have a 1-1 relationship
Make sure to enable managed scaling when creating CP
Choose 100% capacity target
Don't forget to add a default capacity strategy for the cluster
Minimizing EC2 instances used:
If you're using a capacity provider, the 'binpack' placement strategy minimises the number of EC2 hosts that are used.
However, there are some scale-in scenarios where you can end up with a single task running on its own EC2 instance. As Ali mentions in their answer; ECS will not replace this running task, but depending on your setup, it may be fairly easy for you to replace it yourself by configuring your task to voluntarily 'quit'.
In my case; I always have at least 2 tasks running per service. So I just added some logic to my tasks' healthchecks, so they report as unhealthy after ~6 hours. ECS will spot the 'unhealthy' task, remove it from the load balancer, and spin up a replacement (according to the binpack strategy).
Note: If you take this approach; add some variation to your timeout so you're less likely to have all of your tasks expire at the same time. Something like: expiry = now + timedelta(hours=random.uniform(5.5,6.5))
Sharing memory 'headspace' with soft-limits:
If you set both soft and hard memory limits; ECS will place your tasks based on the soft limit. If your tasks' memory usage varies with usage, it's fairly easy to get your EC2 instance to start swapping.
For example: Say you have a task defined with a soft limit of 900mb, and a hard limit of 1800mb. You spin up a service with 4 running instances. ECS provisions all 4 of these instances on a single t3.medium. Notice here that each instance thinks it can safely use up to 1800mb, when in fact there's very little free memory on the host server. When you hit your service with some traffic; each task tries to use some more memory, and your t3.medium is incapacitated as it starts swapping memory to disk. ECS does not recover from this type of failure very well. It notices that the task instances are no longer available, and will attempt to provision replacements, but the capacity provider is very slow to replace the swapping t3.medium.
My suggestion:
Configure your service to auto-scale based on memory usage (this will be a percentage of your soft-limit), for example: a target memory usage of 70%
Configure your tasks' healthchecks so that they report as unhealthy when they are nearing their soft-limit. This way, your tasks still have some headroom for quick spikes of memory usage, while giving your load balancer a chance to drain and gracefully replace tasks that are getting greedy. This is fairly easy to do by reading the value within /sys/fs/cgroup/memory/memory.usage_in_bytes.

Does every aws batch job spin up a new docker container

Every time I submit a batch job, does a new Docker container get created or the old container will be reused.
If a new Docker container is created every time, what happens to the container when the job is done.
In AWS ECS, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION variable sets the time duration to wait from when a task is stopped until the Docker container is removed(by default 3 hours)
If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?
Getting this error CannotCreateContainerError: API error (500): devmapper when running a batch job. Does it help if I clean up the docker container files at the end of the job?
Every time I submit a batch job, does a new Docker container get created or the old container will be reused.
Yes. Each job run on Batch will be run as a new ECS Task, meaning a new container for each job.
If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?
This all depends on your job workloads, lengths, of jobs, disk usage, etc. With large quantities of short jobs that consume disk, this is entirely possible.
CannotCreateContainerError: API error (500): devmapper
Documentation for this error indicates a few possible solutions, however the first, which you've already called out may not help in this case.
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION which defaults to 3h on ECS, seems to be set to 2m by default on Batch Clusters - you can inspect the EC2 User Data on one of your batch instances to validate that it is set this way on your clusters. Depending on the age of the cluster, these settings may change. Batch does not automatically update to the latest ECS Optimized AMI without creation of a whole new cluster, so I would not be surprised if it does not change settings either.
If your cleanup duration setting is currently set low, you might try creating a custom AMI which provisions a larger than normal docker volume. By default, the ECS optimized AMIs ship with an 8GB root drive, and 22GB volume for docker.

AWS ECS running a task that requires many cores

I am conceptually trying to understand how to use AWS ECS to run my "cluster" jobs.
I have some scientific software inside a Docker container, that natively takes advantage of as many cores as the underlying instance has to offer.
My question in this case is, can I use AWS ECS to "increase" the number of "visible" cores to the task running inside my Docker container. For instance, is my "cluster" limited to only a single instance? Or is a "cluster" expandable to multiple instances?
I haven't been able to find any answers my looking through he AWS docs.
Cluster is just some EC2 instances that are ECS-enabled (are running special agent software) and grouped together. Tasks that you run on this cluster are spread across these instances. Each task can involve multiple containers. However, each container stays within its instance ‘boundaries’, hardware-wise. It is allocated a number of “CPU units” and shares them with other containers running on the same instance.
From my understanding, running a process spanning multiple cores in a container is not quite fitting ECS architecture idea—it seems like trying to do part of ECS’s scheduler job.
I found these resources useful when I was reading about it:
My notes on Amazon's ECS post by Jérôme Petazzoni
Application Architecture in ECS docs
Task Definition Parameters in ECS docs
I had a similar situation moving a Python app that used a script to spawn copies of itself based on the number of cores. The answer to this isn't so much an ECS problem as it is a Docker best practice... you should strive to use 1 process per container. (see https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/)
How I ended up implementing this was using a Dockerfile to run each process and then used essential ECS tasks so it will reload itself if the task died.
Your cluster is a collection of EC2 instances with the ECS service running. Each instance has a certain number of CPU 'units' (typically 1024 units === 1 core) and RAM. I profiled my app at peak load and tweaked the mix until I got it where I liked it. If your app can use more CPU than that, try giving it 2048 CPU or some other amount and see how it performs. I used Meros (https://meros.io/) to profile my app.
Hope this helps!
"increase" the number of "visible" cores to the task running inside my Docker container
Container and cluster is different things, you may run lot of containers on one instance, but you can't run one container on multiply instances.
Cluster - it is set of docker containers.
is my "cluster" limited to only a single instance?
no, you may choose number of instances in cluster

How can I apply chef configuration without registering the node in the server?

I am programming some short-lived EC2 instances. They will exist for only a few hours to do a job every now and again but will require a very large EBS volume; to keep it around all the time would cost hundreds of dollars a month. Because EBS volumes are pro-rated, I can just allocate this volume when I need it and discard it after the job is complete so the cost will not be all that high (EBS volumes are billed hourly after all).
Unfortunately the elastic file store is not yet available in my region, and it's also in a preview mode at the moment so probably not suitable for production use yet anyway.
Anyway, that's really just background. What I'd like to do is is have my instance automatically configure itself when it is started using user data. I would like it to download a script from an S3 repository that instructs it to install chef-client and execute a chef-client run that will set up the node. It will then run another command which will kick off the job. Once that's complete, the AWS Data Pipeline will automatically terminate the instance.
The one point I don't like about the above is that when I register the node, the node will be registered in my Chef server. I'd like to just download the configuration for a specified role without actually registering anything. I'll never need to run the configuration again because the instance will be gone in a couple of hours once the job is complete.
I could of course script the entire setup and execution of the above using shell scripts but I'd rather tie it in with all the Chef infrastructure we've already built, which is integrated with our CI server and is fully source-controlled and so on.
You could use chef-provisioning and/or knife-zero.
They start a chef-zero in-memory server locally, then bootstrap a node which connects through the SSH connection back to your local chef-zero server. After the converge, the connection is shut down. It's much like rsync+chef-solo but on steroids.
See:
https://github.com/higanworks/knife-zero
https://github.com/chef/chef-provisioning
https://github.com/chef/chef-provisioning-aws