AWS Batch jobs killed due because memory requirement == memory limit?

AWS Batch jobs killed due because memory requirement == memory limit? - amazon-web-services

In AWS Batch, when I specify a memory requirement of e.g. 32000MB, my job ends up getting killed because (a) the actual instance autoselected has 64GB memory and (b) ECS seems to view 32000MB as both a requirement and a hard limit ("If your container attempts to exceed the memory specified here, the container is killed" from https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html). So as soon as my job goes slightly above 32GB, it gets killed, even though I am happy for it to use up to the 64GB.
How do I properly specify a minimum memory requirement without causing AWS Batch to kill jobs that go slightly above that? It seems very strange to me that the "memory" parameter appears to be both a minimum and a maximum.
I assume I'm misunderstanding something.

The memory requirements in the resourceRequirements property are always the maximum/upper bound. You specify there how much memory at max your job container is going to use.
Quote from https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html :
The hard limit (in MiB) of memory to present to the container. If your container attempts to exceed the memory specified here, the container is killed.
A lower/minimum bound would not make much sense, since AWS needs to put your job container onto a host that actually supports the upper bound/limit, because there is no way to tell a priori how much actual memory your container is going to use.
Or put another way: If there were such a thing as a "minimum" requirement and you specified minimum = 1 MiB and maximum = 16 GiB, what is AWS Batch supposed to do with this information? It cannot put your job container onto a host with say 512 MiB of memory because your job container, as it runs, may exceed that, since you said the maximum was 16 GiB (in this example). And AWS Batch is not going to freeze a running job and migrate it onto another host, once the current host's memory is reached.
The fact that AWS Batch decided to put your concrete job container onto an instance with 64 GiB may be coincidental because 32 GiB is just the border of an instance's memory size 32 GiB <-> 64 GiB. And if your job were to use the full 32 GiB then there wouldn't be any memory left for the host (without swapping).

Related

Lambda instance ram allocation

In aws lambda the ram allocated for a lambda is for one instance of that lambda or for all running instance of that lambda? Till now I believed its for each instance.
Let's consider a lambda 'testlambda' and I am configuring it to have 5 Minutes timeout and 3008 MB (current max) RAM and have "Use unreserved account concurrency" option selected:
At T one instance of 'testlambda' start running and assume that it is going to run for 100 seconds and going to use 100 MB of RAM while it is running(for the whole 100 seconds), if one more instance of 'testlambda' start at T+50s how much RAM will be available for the second instance 3008 MB or 2908 MB ?
I used to believe that the second instance will also have 3008 MB. But after seeing the recent execution logs of my lambda I am inclined to say that for the second instance will have 2908 MB.

The allocation is for each container.
Containers are not used by more than one invocation at any given time -- that is, containers are reused, but not concurrently. (And not by more than one version of one function).
If your code is leaking memory, this means subsequent but non-concurrent invocations spaced relatively close together in time will be observed as using more and more memory because they are running in the same container... but this would never happen in the scenario you described, because with the second invocation at T+50, it would never share the container with the 100-second process started at T+0.

From what i saw, at least so far, the ram is not shared. We had a lot of concurrent requests with the default ram for lambdas, if for some reason this was shared we would see problems related to memory, but that never happened.
You could test this by reducing the amount of ram of a dummy lambda that would execute for X seconds and try to call it several times to see if the memory used is greater than the memory you selected.

How does memory allocation impact processing time at AWS lambda?

My lambda function was taking about 120ms with 1024mb memory size. When I checked the log, it was using only 22mb at max, so I tried optimizing it, reducing to 128mb.
But, when I did this, the ~120ms of processing went up to about ~350ms, but still, only 22 mb was being used.
I'm a bit confused, if I just used 22mb, then why having 128 or 1024mb available impact the processing time?

The underlying CPU power is directly proportional to the memory footprint that you select. So basically that memory knob controls your CPU allocation as well.
So that is the reason why you are seeing that reducing the memory causes Lambda to take more time for execution
Following is what is documented on AWS Docs for Lambda
Compute resources that you need – You only specify the amount of memory you want to allocate for your Lambda function. AWS Lambda allocates CPU power proportional to the memory by using the same ratio as a general purpose Amazon EC2 instance type, such as an M3 type. For example, if you allocate 256 MB memory, your Lambda function will receive twice the CPU share than if you allocated only 128 MB.

Spark Using Disk Resources When Memory is Available

I'm working on optimizing the performance of my Spark cluster (run on AWS EMR) that is performing Collaborative Filtering using the ALS matrix factorization algorithm. We are using quite a few factors and iterations so I'm trying to optimize these steps in particular. I am trying to understand why I am using disk space when I have plenty of memory available. Here is the total cluster memory available:
Here is the remaining disk space (notice the dips of disk utilization):
I've tried looking at the Yarn manager and it looks like it shows that each node slave has: 110 GB (used) 4 GB (avail.). You can also see the total allocated on the first image (700 GB). I've also tried changing the ALS source and forcing the intermediateRDDStorageLevel and finalRDDStorageLevel from MEMORY_AND_DISK to MEMORY_ONLY and that didn't affect anything.
I am not persisting my RDD's anywhere else in my code so I'm not sure where this disk utilization is coming from. I'd like to better utilize the resources on my cluster, any ideas? How can I more effectively use the available memory?

There can be few scenerios where spark will be using the disk usage instead of memory
If you have shuffle operation. Spark writes the shuffled data in the disk only so if you have shuffle operation you are out of luck
Low executor memory. If you have low executor memory spark has less memory to keep the data so it will be spilling the data from memory to disk. However as you suggested you have tried executor memory from 20G to 40G. I will recommend to keep the executor memory till 40G as beyoind that JVM GC could make your process slower.
If you don't have shuffle operation you might as well tweak spark.memory.fraction if you are using spark 2.2
From documentation
spark.memory.fraction (doc) expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
large records.
So you can make the spark.memory.fraction to .9 and see the behavior.
Lastly there are options apart from MEMORY_ONLY as storage level like MEMORY_ONLY_SER which will serialize the data and store in memory. This option reduces the memory usage as serialized object size is much smaller than the actual object size. If you see lot of spill you can opt this storage level.

AWS ECS Task Memory Hard and Soft Limits

I'm confused about the purpose of having both hard and soft memory limits for ECS task definitions.
IIRC the soft limit is how much memory the scheduler reserves on an instance for the task to run, and the hard limit is how much memory a container can use before it is murdered.
My issue is that if the ECS scheduler allocates tasks to instances based on the soft limit, you could have a situation where a task that is using memory above the soft limit but below the hard limit could cause the instance to exceed its max memory (assuming all other tasks are using memory slightly below or equal to their soft limit).
Is this correct?
Thanks

If you expect to run a compute workload that is primarily memory bound instead of CPU bound then you should use only the hard limit, not the soft limit. From the docs:
You must specify a non-zero integer for one or both of memory or memoryReservation in container definitions. If you specify both, memory must be greater than memoryReservation. If you specify memoryReservation, then that value is subtracted from the available memory resources for the container instance on which the container is placed; otherwise, the value of memory is used.
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
By specifying only a hard memory limit for your tasks you avoid running out of memory because ECS stops placing tasks on the instance, and docker kills any containers that try to go over the hard limit.
The soft memory limit feature is designed for CPU bound applications where you want to reserve a small minimum of memory (the soft limit) but allow occasional bursts up to the hard limit. In this type of CPU heavy workload you don't really care about the specific value of memory usage for the containers that much because the containers will run out of CPU long before they exhaust the memory of the instance, so you can place tasks based on CPU reservation and the soft memory limit. In this setup the hard limit is just a failsafe in case something goes out of control or there is a memory leak.
So in summary you should evaluate your workload using load tests and see whether it tends to run out of CPU first or out of memory first. If you are CPU bound then you can use the soft memory limit with an optional hard limit just as a failsafe. If you are memory bound then you will need to use just the hard limit with no soft limit.

#nathanpeck is the authority here, but I just wanted to address a specific scenario that you brought up:
My issue is that if the ECS scheduler allocates tasks to instances
based on the soft limit, you could have a situation where a task that
is using memory above the soft limit but below the hard limit could
cause the instance to exceed its max memory (assuming all other tasks
are using memory slightly below or equal to their soft limit).
This post from AWS explains what occurs in such a scenario:
If containers try to consume memory between these two values (or
between the soft limit and the host capacity if a hard limit is not
set), they may compete with each other. In this case, what happens
depends on the heuristics used by the Linux kernel’s OOM (Out of
Memory) killer. ECS and Docker are both uninvolved here; it’s the
Linux kernel reacting to memory pressure. If something is above its
soft limit, it’s more likely to be killed than something below its
soft limit, but figuring out which process gets killed requires
knowing all the other processes on the system and what they are doing
with their memory as well. Again the new memory feature we announced
can come to rescue here. While the OOM behavior isn’t changing, now
containers can be configured to swap out to disk in a memory pressure
scenario. This can potentially alleviate the need for the OOM killer
to kick in (if containers are configured to swap).

Amazon RDS running out of freeable memory. Should I be worried?

I have an Amazon RDS instance. Freeable Memory has been declining since setup over 1-2 weeks, starting from 15GB of memory down to about 250MB. As it has dropped this low in the last days, it has started to resemble a sawtooth pattern where Freeable Memory drops to this range (250 - 350MB) and then goes back up agin to 500 - 600MB in a sawtooth pattern.
There has not been any notable decline in application quality. However, I am worried that the DB will run out of memory and crash.
Is there a danger that the RDS instance will run out of memory? Is there some setting or parameter I should be looking at to determine if the instance is set up correctly? What is causing this sawtooth pattern?

Short answer - you shouldn't worry about FreeableMemory unless it is became really low (about 100-200 Mb) or significant swapping occur (see RDS SwapUsage metric).
FreeableMemory is not a MySQL metric, but OS metric. It is hard to give precise definition, but you can treat it as memory which OS will be able to allocate to anyone who request it (in your case it likely will be MySQL). MySQL have a set of settings which are restricting it's overall memory usage to some cap(you can use something like this to actually calculate it). It's unlikely that your instance will ever hit this limit, due to the fact that in general you never reach max number of connections, but this is still possible.
Now going back to "decline" in FreeableMemory metric. For the MySQL most of the memory consume by InnoDB buffer pool (see here for details). RDS instances in there config by default have size for this buffer set to 75% of hosts physical memory - which in your case is about 12 GB. This buffer is used for caching all DB data which used in both read and write operations.
So in your case, since this buffer is really big - it is slowly filling with data which cached (it is likely that this buffer is actually big enough to cache all DB). So when you first start you instance this buffer is empty and than once you start reading/writing stuff into DB all this data are bringing into cache. They will stay here up to the time when this cache became full and new request came. At this time least recently used data will be replaced with new data. So initial decline of FreeableMemory after DB instance restart explains with this fact. It is not a bad thing, cause you actually want as much as possible of you data to be cache in order for you DB to work faster. The only thing which can go nasty is when part or all of this buffer will be pushed out of physical memory into swap. At that point you will have huge performance drop.
As a preventive care it might be a good idea to tune MySQL max memory used for different thing in case you FreeableMemory metric is constantly on a level of 100-200 Mb, just to reduce possibility of swapping.

Freeable memory field is used by MySQL for buffering and caching for it`s own processes. It is normal for the amount of Freeable memory to decrease over time. I wouldn't be worried it kicks old info out as it demands more room.

After several support tickets at AWS I found that tuning the parameters groups can help, specially the shared buffer, lowering them to keep a reserved quantity to avoid drops or failovers due to lack of memory

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js