What is reserved memory in YARN and why it shows spike? - amazon-web-services

I have a doubt around what exactly is reserved memory in YARN?
I do understand that its YARN's way of balancing the memory requirement raised by several jobs that are submitted, so that no job goes in starvation mode. It tries to reserve the memory as and when it is freed up for another job.
We use AWS EMR for our operations. I have observed that at times when a memory intensive job is submitted on our cluster(say spark-sql job) 1TB ram gets reserved out of our total 3TB RAM. I have observed this even when just one job is submitted/running on the cluster and none other are waiting or queued up.
This spike in memory is observed intermittently at times in range of 5-15 mins and then comes down to manageable level or even 0.
Can someone please explain if this is normal behavior or not. If its normal please explain it in little details. Else if there is some probable configuration mistake that can trigger this please help me in resolving this.
Note -> We have R4 8xlarge 10 node cluster on EMR 5.0.0.
Thanks in advance
Manish Mehra

Related

Memory limit exceeded error while the container memory utilization is below 55%

I'm having the following error in a service that runs on Cloud Run
Memory limit of 256M exceeded with 432M used
I don't understand why the service got such an error because from what I saw from the Container memory utilization, the service only uses around 50% of the total available memory.
The container memory utilization graph doesn't show the realtime values and doesn't pick the memory utilisation everytime. it picks the values time to time (I don't know the frequency should something like every 30s) and then show you the average/normal memory utilisation.
However, if you have a pick in your usage (a big file, a special request,...) that, in a few milliseconds, crash the service because of OOM, the monitoring system doesn't have the time to pick the container memory utilisation value and to show it to you.
You should investigate on the request that cause this OOM issue. You can also describe your service/share piece of code if you need help.

Seeing RDS Deadlock errors, could this be tied to IOPS limit

I'm seeing some errors on our AWS RDS MySQL server:
General error: 1205 Lock wait timeout exceeded; try restarting transaction
Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
Looked at the RDS console monitoring tab, and there seems read IOPS is cut off 1, perhaps indicating that the disk IO is not keeping up with the requests. Funny thing is that write IOPS does not seem to be cut off 2. In general there's very few app server requests that fail due to the database error, but would like to get this sorted.
CPU load on the RDS server peaks around 50%. This makes me think the db.t3.small RDS size is sufficient.
The database is tiny, just 20GB and was created some years ago, so it's on magnetic storage. Have read that this means there's a limit of 200 IOPS, which matches the approx 150 + 50 IOPS peaks seen. I am therefore thinking about moving to General Purpose SSD. However for the small db this will only provide 100 IOPS as baseline performance according to the docs, but according to the docs, a burst load of 3000IOPS is possible.
Does this sound like a good move, and any other suggestions on what to do?
I have been running with General Purpose SSD for a couple of days now. The MySQL deadlock errors have not been seen since, so in case someone else finds the question, change from Magnetic to General Purpose SSD in RDS is certainly something to try out if you have similar problems.

AWS ElasticSearchService index_create_block_exception

I'm trying to create a new index in AWS ElasticSearch cluster after increasing the cluster size and seeing index_create_block_exception. How can i rectify this? I tried searching but did not get exact answers. Thank you.
curl -XPUT 'http://<aws_es_endpoint>/optimus/'
{"error":{"root_cause":[{"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"}],"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"},"status":403}
According to AWS, the above exception is being thrown due to a low memory in disks.
For t2 instances, when the JVMMemoryPressure metric exceeds 92%,
Amazon ES triggers a protection mechanism by blocking all write
operations to prevent the cluster from getting into red status. When
the protection is on, write operations will fail with a
ClusterBlockException error, new indexes cannot be created, and the
IndexCreateBlockException error will be thrown.
I'm afraid the issue is still on.
You'll also get this error if you run out of disk space.
This should of course not happen after increasing the cluster size, but if you suddenly get this error it'd be worth to check that all your instances has storage left - i.e. don't look at the Total free storage space graph but at the Minimum free storage space one.

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.
Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.