AWS ECS Task Out of Memory - Cloudwatch Alarm - amazon-web-services

I have an ECS Service that uses multiple tasks in order to execute a daily job. The memory that every task uses varies depending on the data it process. I have set to 16GB Ram in all my tasks but some tasks stopped with the following error "OutOfMemory".
Unfortunately, I can't break down the data that each task process because it has to be processed all together in order to produce the insights I want.
I know how to set up alarms for ECS Services for RAM and CPU. But RAM and CPU for the service refer to the Average of CPU and RAM for all tasks.
How can I set an alarm in order to trigger when a Task runs out of memory?
Is there a suggested way to not encounter the OutOfMemory error ?

I believe you have to enable ECS CloudWatch Container Insights to get per-task and per-container memory usage. Once you do that you will begin to see metrics for task memory usage (among other things) in CloudWatch that you can create alarms for.
Note that there is an added cost involved with enabling Container Insights.
Is there a suggested way to not encounter the OutOfMemory error ?
From an infrastructure perspective, all you can do is start provisioning more RAM for your tasks.
From an application perspective you could analyze your application for memory leaks, and examine the data structures your application is creating in memory for possible opportunities, like reducing duplicated data in memory, or moving some of the data to disk, or to a distributed cache. This sort of memory optimization work is extremely application specific.

Related

ECS clarify on resources

I'm having trouble understanding the config definitions of a task.
I want to understand the resources. There are a few options (if we talk only about memory):
memory
containerDefinitions.memory
containerDefinitions.memoryReservation
There are a few things I'm not sure about.
First of all, the docs say that when the hard limit is exceeded, the container will stop running. Isn't the goal of a container orchestration service to keep the service alive?
Root level memory must be greater than all containers memory. In theory I would imagine once there aren't enough containers deployed, new containers are created for the image. I wouldn't like to use more resources than I need, but if I reserve the memory on root level, first, I do reserve much more than needed, and second, if my application receives a huge load, the whole cluster will shut down if the memory limit is exceeded or what?
I want to implement a system that auto-scales, and I would imagine that this way I don't have to define resources allocated, it just uses the amount needed, and deploys/kills new containers if the load increases/decreases.
For me there are a lot of confusion around ECS, and Fargate, and how it works, how it scales, and the more I read about it, the more confusing it gets.
I would like to set the minimum amount of resources per container, at how much load to create a new container, and at how much load to kill one (because it's not needed anymore).
P.S. not experienced in devops in general, I used kubernetes at my company, and there are things I'm not clear about, just learning this ECS world.
First of all, the docs say that when the hard limit is exceeded, the container will stop running. Isn't the goal of a container orchestration service to keep the service alive?
I would say the goal of a container orchestration service is to deploy your containers, and restart them if they fail for some reason. A container orchestration service can't magically add RAM to a server as needed.
I want to implement a system that auto-scales, and I would imagine that this way I don't have to define resources allocated, it just uses the amount needed, and deploys/kills new containers if the load increases/decreases.
No, you always have to define the amount of RAM and CPU that you want to reserve for each of your Fargate tasks. Amazon charges you by the amount of RAM and CPU you reserve for your Fargate tasks, regardless of what your application actually uses, because Amazon is having to allocate physical hardware resources to your ECS Fargate task to ensure that much RAM and CPU are always available to your task.
Amazon can't add extra RAM or CPU to a running Fargate task just because it suddenly needs more. There will be other processes, of other AWS customers, running on the same physical server, and there is no guarantee that extra RAM or CPU are available on that server when you need it. That is why you have to allocate/reserve all the CPU and RAM resources your task will need at the time it is deployed.
You can configure autoscaling to trigger on the amount of RAM your tasks are using, to start more instances of your task, thus spreading the load across more tasks which should hopefully reduce the amount of RAM being used by each of your individual tasks. You have to realize each of those new Fargate task instances created by autoscaling are spinning up on different physical servers, and each one is reserving a specific amount of RAM on the server they are on.
I would like to set the minimum amount of resources per container, at how much load to create a new container, and at how much load to kill one (because it's not needed anymore).
You need to allocate the maximum amount of resources all the containers in your task will need, not the minimum. Because more physical resources can't be allocated to a single task at run time.
You would configure autoscaling with the target value, of for example 60% RAM usage, and it would automatically add more task instances if the average of the current instances exceeds 60%, and automatically start removing instances if the average of the current instances is well below 60%.

High Memory Utilization in AWS EC2

I have used AWS EC2 t2.large server for my web application.
I have setup the custom metric for MemoryUtilization. After setup, When i viewed the MemoryUtilization Metric, it shows more than 85% almost all time.
Also, I have checked the CPU Utilization for the same instance, it is less than 10% in most of the time.
I am wondering how MemoryUtilization has gone such high? What might be the possible options to reduce them? Is it due to the virtualization system of AWS?
Your application has memory leaks or unnecessary memory usage.
Try using any memory leak detection tools to fix the application.
If you don't want to fix your application, try changing the instance type to any Memory Optimized instance type.

start second instance AWS when the first reaches 85% of memory or cpu,

I have the following scenario:
I have two Windows servers on AWS that run an application via IIS. For particularities of the application, they work with HTTP load balancing on the IIs.
To reduce costs, I was asked, that the second instance is only started when the first one reaches 90% CPU usage or 85% memory usage.
In my zone (sa-east-1), there are still no Auto Scaling Groups.
Initially, I created a cloudwatch event to start the second instance when it detected high CPU usage at first. The problem is that Cloudwatch, natively still does not monitor memory and so far I'm having trouble customizing this type of monitoring.
Is there any other way for me to be able to start the second instance based on the above conditions?
Since the first instance is always running, it might be something Windows-level, some powershell that detects the high memory usage and start the second? The script to start instances via powershell I already own, I just need help with how to detect the high memory usage event to start the second instance from it.
or some third-party application that does so...
Thanks!
Auto Scaling groups are available in sa-east-1, so use them
Pick one metric upon which to scale (memory OR CPU), do not pick both otherwise it would be confusing how to scale when one metric is high and the other is low.
If you wish to monitor Windows memory in CloudWatch, see: Sending Logs, Events, and Performance Counters to Amazon CloudWatch - Amazon Elastic Compute Cloud
Also, be careful using a metric such as "memory usage" to measure the need to launch more instances. Some systems use garbage collection to free-up memory, but only when available memory is low (rather than continuously).
Plus, make sure your application is capable of running across multiple instances, such as putting it behind a load balancer (depending on what the application actually does).

Monitoring works or identifying bottlenecks in data pipeline

I am using google cloud datafow. Some of my data pipelines needs to be optimized. I need to understand how workers are performing in the dataflow cluster on these lines .
1. How much memory is being used ?
Currently I am logging memory usage using java code .
2. Is there a bottleneck on the disk operations ? To understand whether a SSD is required ?
3. Is there a bottleneck in Vcpus ? So as to increase the Vcpus in workers nodes.
I know stackdriver can be used to monitor Cpu and disk usage for the cluster. However it does not provide information on individual workers and also on whether we are hitting the bottle neck in these.
Within the Dataflow Stackdriver UI, you are correct, you cannot view the individual worker's metrics. However, you can certainly setup a Stackdriver Dashboard which gives you the invdividual worker metrics for all of what you have mention. Below is a sample dashboard which shows metrics for CPU, Memory, Network, Read IOPs, and Write IOPS.
Since the Dataflow job name will be part of the GCE instance name, here I filter down the GCE instances being monitored by the job name I'm interested in. In this case, my Dataflow job was named "pubsub-to-bigquery", so I filtered down to instance_name ~= pubsub-to-bigquery.*. I did a regex filter to be sure I captured any job names which may be suffixed with additional data in future runs. Setting up a dashboard such as this can inform you when you'd actually benefit from SSDs, more network bandwidth, etc.
Also be sure to check the Dataflow job graph in the cloud console when looking to optimize your pipeline. The wall time below the step name can give a good indication on what custom transforms or dofns should be targeted for optimization.

How to get consistent CPU utilization on AWS

I've now learnt that when I start a new EC2 instance it has a certain number of CPU credits due to which it's performance is high when it starts processing but gradually reduces over time as the credits run out. Past that point, the instance runs at which appears to be the baseline CPU utilisation rate. To numerate, when I started the EC2 instance (t2.nano), Cloudwatch reported around 80% CPU utilisation gradually decreasing down to 5%.
Now I'm happy to use one of the better instance types pending the instance limit request. But whilst that is in progress, I'd like to know whether the issue of reducing performance over time will still hold even with the better instance type?
Would I require a dedicated host setup if I wish to ensure I get consistent CPU utilisation? The only problem I can see here is that I'm running a SQS worker queue and Elastic Beanstalk allows us to easily setup a worker environment which reads messages from the queue. From what I've read and from looking at the configuration options available in Elastic Beanstalk, I don't think I'll be able to launch instances into a dedicated host directly. Most of my reading has lead me to believe that I'll have to learn how to use a VPC. Would that be correct?
So I guess my questions are - would simply increasing the instance type to a more powerful instance guarantee consistent CPU utilisation performance or is a dedicated host required and if so, is it possible to set up one with Elastic Beanstalk or would it have to be setup manually and if it is set up manually can it be configured to work with an SQS queue automatically?
If you want consistent CPU performance, you should avoid the burstable performance instances (the T2 family). All other families of instances (M5, C5, etc) will have consistent CPU performance over time. You can use any instance family with Elastic Beanstalk. No need for a dedicated host.