aws ecs instances running out of space - amazon-web-services

Since this morning I'm having troubles while updating services in AWS ECS. The tasks fails to start. The failed tasks shows this error:
open /var/lib/docker/devicemapper/metadata/.tmp928855886: no space left on device
I have checked disk space and there is.
/dev/nvme0n1p1 7,8G 5,6G 2,2G 73% /
Then I have checked the inodes usage, and I found that 100% are used:
/dev/nvme0n1p1 524288 524288 0 100% /
Narrowing the search I found that Docker volumes are the ones using the inodes.
I'm using the standard Centos AMI.
Does this mean that there is a maximum number of services that can run on a ECS cluster? (at this moment I'm running 18 services)
This can be solved? At this moment I can't do updates.
Thanks in advance

You need to tweak the following environment variables on your EC2 hosts:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
ECS_IMAGE_CLEANUP_INTERVAL
ECS_IMAGE_MINIMUM_CLEANUP_AGE
ECS_NUM_IMAGES_DELETE_PER_CYCLE
You can find the full docs on all these settings here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-config.html
The default behavior is to check every 30 minutes, and only delete 5 images that are more than 1 hour old and unused. You can make this behavior more aggressive if you want to clean up more images more frequently.
Another thing to consider to save space is rather than squashing your image layers together make use of a common shared base image layer for your different images and image versions. This can make a huge difference because if you have 10 different images that are each 1 GB in size that takes up 10 GB of space. But if you have a single 1 GB base image layer, and then 10 small application layers that are only a few MB in size that only takes up a little more than 1 GB of disk space.

Related

Why not get 2 nano instances instead 1 micro instance on AWS?

I'm choosing instances do run microservices on an AWS EKS cluster.
When reading about it on this article an taking a look on the aws docs it seems that choosing many small instances instead of on larger instance results on a better deal.
There seems to be no downside on taking, for instance, 2 t3.nano (2 vCPU / 0.5GiB each) vs 1 t3.micro (2 vCPU / 1GiB each). The price and the memory are the same but the CPU provided has a huge difference the more instances you get.
I assume there are some processes running on each machine by default, but I found no places metioning its impact on the machine resources or usage. Is it negligible? Is there any advantage on taking one big instance instead?
The issue is whether or not your computing task can be completed on the smaller instances and also there is an overhead involved in instance-to-instance communication that isn't present in intra-instance communication.
So, it is all about fitting your solution onto the instances and your requirements.
There is no right answer to this question. The answer depends on your specific workload, and you have to try out both approaches to find out what works best for your case. There are advantages and disadvantages to both approaches.
For example, if the OS takes 200 MB for each instance, you will be left with only 600 MB both nano instances combined vs the 800 MB on the single micro instance.
When the cluster scales out, initializing 2 nano instances might roughly take twice as much time as initializing one micro instance to provide the same additional capacity to handle the extra load.
Also, as noted by Cargo23, inter-instance communication might increase the latency of your application.

Task takes too much time pending on ECS

I've been with a weird problem for some days.
I'm implementing the ECS logic to drain instances on termination (specifically on Spot interruption notice) using the ECS_ENABLE_SPOT_INSTANCE_DRAINING=true env var on the ecs-agent.
The process works fine, when an interruption notice arrives, ECS drains the instance and moves the containers to another one, but here is the problem, if the instance never started that image before, it takes too much time to start (About 3 min, when the spot interruption time is in 2 min) causing availability issues. If the image started in that instance before, it only takes 20 sec to spin up the task!
Have you experienced this problem before using ECS?
PD: The images are about 500MB is that large for an image??
There are some strategies available to you:
Reduce the size of the image by optimising the Dockerfile. A smaller image is quicker to pull from the repository.
Bake the large image into the AMI used in the cluster. Now every new spot machine will have the image. Depending on how the Dockerfile is created, a significant number of layers could be reused resulting on quicker image pulls.
Once the image is pulled to the machine, the image is cached and subsequent pulls will almost be instantaneous.

Neo4j performance discrepancies local vs cloud

I am encountering drastic performance differences between a local Neo4j instance running on a VirtualBox-hosted VM and a basically identical Neo4j instance hosted in Google Cloud (GCP). The task involves performing a simple load from a Postgres instance also located in GCP. The entire load takes 1-2 minutes on the VirtualBox-hosted VM instance and 1-2 hours on the GCP VM instance. The local hardware setup is a 10-year-old 8 core, 16GB desktop running VirtualBox 6.1.
With both VirtualBox and GCP I perform these similar tasks:
provision a 4 core, 8GB Ubuntu 18 LTS instance
install Neo4j Community Edition 4.0.2
use wget to download the latest apoc and postgres jdbc jars into the plugins dir
(only in GCP is the neo4j.conf file changed from defaults. I uncomment the "dbms.default_listen_address=0.0.0.0" line to permit non-localhost connections. Corresponding GCP firewall rule also created)
restart neo4j service
install and start htop and iotop for hardware monitoring
login to empty neo4j instance via broswer console
load jdbc driver and run load statement
The load statement uses apoc.periodic.iterate to call apoc.load.jdbc. I've varied the "batchSize" parameter in both environments from 100-10000 but only saw marginal changes in either system. The "parallel" parameter is set to false because true causes lock errors.
Watching network I/O, both take the first ~15-25 seconds to pull the ~700k rows (8 columns) from the database table. Watching CPU, both keep one core maxed at 100% while another core varies from 0-100%. Watching memory, neither takes more than 4GB and swap stays at 0. Initially, I did use the config recommendations from "neo4j-admin memrec" but those didn't seem to significantly change anything either in mem usage or overall execution time.
Watching disk, that is where there are differences. But I think these are symptoms and not the root cause: the local VM consistently writes 1-2 MB/s throughout the entire execution time (1-2 minutes). The GCP VM burst writes 300-400 KB/s for 1 second every 20-30 seconds. But I don't think the GCP disks are slow or the problem (I've tried with both GCP's standard disk and their SSD disk). If the GCP disks were slow, I would expect to see sustained write activity and a huge write-to-disk queue. It seems whenever something should be written to disk, it gets done quickly in GCP. It seems the bottleneck is before the disk writes.
All I can think of are that my 10-year-old cores are way faster than a current GCP vCPU, or that there is some memory heap thing going on. I don't know much about java except heaps are important and can be finicky.
Do you have the exact same :schema on both systems? If you're missing a critical index used in your LOAD query that could easily explain the differences you're seeing.
For example, if you're using a MATCH or a MERGE on a node by a certain property, it's the difference between doing a quick lookup of the node via the index, or performing a label scan of all nodes of that label checking every single one to see if the node exists or if it's the right node. Understand also that this process repeats for every single row, so in the worst case it's not a single label scan, it's n times that.

AWS Ubuntu Server filling up, but why?

I'm currently on a free-tier AWS AmazonEC2 server with Ubuntu 16.04 installed. The main purpose is a web server running HTTP serving HTML/PHP pages and a MySQL database. The MySQL database is about 7GB large. It hasn't been inserting data for months so I don't believe the database is at fault here.
Currently Amazon is telling me my storage is at 27GB. About two days ago it was at 25GB. I haven't even touched the server in about a month and I absolutely have not been installing anything. I'm trying to find out what is taking up all this data.
I installed ncdu and switched to root, ran it and these are the results:
As you can see it's absolutely nowhere near 27GB, it's at just over 13GB. So where is this other 14GB of storage coming from? How do I find this out?
I'm afraid it's going to go over the 30GB free-tier limit and I don't know what will happen or how much I will be charged (I don't know how to find this out either).
You're misunderstanding the email. It has nothing to do with how much data you have stored.
It has to do with how much EBS storage you have used -- that is, how much you have provisioned over time. EBS doesn't bill based on what you store, it bills based on how big the disks are, regardless of what you put on them. Billing is in gigabyte-months.
A volume of size 1 gigabyte that exists for 30 days is said to "use" 1 gigabyte-month of EBS capacity.
1 gigabyte for 1 day is 1 day × 1 gb ÷ 30 days/mo =~ 0.333 gigabyte-month.
30 gigabytes for 30 days uses 30 gigabyte-months of EBS capacity.
So 30 gigabytes for 27 days would be 27 gigabyte-months, and 30 gigabytes for 25 days would be 25 gigabyte-months.
Currently Amazon is telling me my storage is at 27GB. About two days ago it was at 25GB.
So this is normal, exactly what you'd expect if you had a single 30 GB EBS volume. Your usage over time is growing because time is passing, not because usage is increasing.
AWS can't see¹ what you've stored on the volume -- they have no idea how full it is... only how large it is, which isn't a number that changes unless you resize the volume to make it physically larger.
¹ can't see may seem implausible but is true for multiple reasons, including the simple fact that they simply don't look. An EBS volume is a block storage device. Although typically the volumes are used for standards-based filesystems, there's no constraint on that. They can be used in any other way that you can use a block device. But even when used as a filesystem, the concept of "free space" is a concept only the filesystem itself understands -- not the raw, underlying device.
The issue had to do with an elastic IP being detached from a server while the server was down at one point for an amount of time.

AWS disksize degradation

i have 1 ami which was created from instance with m4 large and 1000GB disk space. Now by using ami i have sponge the instance which by default has m4 large and 1000 GB disk
but we are thinking for downgrading the sponge instance to t2 medium and 200 GB is it possible.
THe first part of it that is bringing it to t2 medium is done but we are stuck at the disk size down-gradation
AWS doesn't provide a way to do this directly, but it is possible with some effort. This page outlines the process: https://cloudacademy.com/blog/amazon-ebs-shink-volume/
Essentially, you mount a smaller volume of the desired size to the same system and copy the files (or mirror a smaller partition), then you switch to using the smaller volume.
The exact process you use for doing that will vary depending on what operating system you're using, as well as whether or not the volume you're attempting to shrink is bootable.