How to increase ulimit on AWS EMR with AutoScaling, dynamically? - amazon-web-services

I have a Spark application in Java, running on AWS EMR. I have implemented an AutoScaling policy based on the available yarn memory. For jobs which require higher memory, EMR brings up cluster up to 1+8 nodes.
After a point of time in my job I keep getting the below error, this error goes on for hours before I terminate cluster manually.
java.io.IOException: All datanodes [DatanodeInfoWithStorage[<i.p>:50010,DS-4e7690c7-5946-49c5-b203-b5166c2ff58d,DISK]] are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1531)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
This error is in the very first worker node that was spawned. After some digging, I found out this might be because of ulimit. Now increasing ulimit can be done easily on any Linux or EC2 machines manually. But I am unable to get how to do this dynamically every EMR cluster that is spawned.
Further, I am not even 100% sure if ulimit is causing this particular issue. This might be something else as well. I can confirm only once I change ulimit and check.

Related

Is there data for AWS spot interruption rate over time?

We are running an EMR cluster with spot instances as task nodes. The EMR cluster is executing spark jobs which sometimes run for several hours. Interruptions of spot instances can cause the failure of the spark job which then requires us to restart the job entirely.
I can see that there is some basic information on the "Frequency of interruption" on AWS Spot Advisor - However, this data seems to be very generic, I can't see historic trends and I also miss the probability of interruption based on how long the spot instance is running (which should have a significant impact on the probability of interruption).
Is this data available somewhere? Or are there other data points that can be used as proxy?
I found this Github issue which provides a link to this JSON file in Spot Advisor S3 bucket that includes interruption rates.
https://spot-bid-advisor.s3.amazonaws.com/spot-advisor-data.json

Daemon service in ECS throwing no space left on device

Recently after a production deployment, our primary service was not reaching steady state. On analysis we found out that the filebeat service running as a daemon service was unsteady. The stopped tasks were throwing "no space left on the device". Also, the CPU and memory utilization for the filebeat was higher than the primary service.
A large amount of log files were being stored as part of the release. After reverting the change, the service came back to steady state.
Why did filebeat become unsteady? If memory was an issue, then why didn't the service also throw "no space" issue as both filebeat and primary service runs on the same EC2 instance?
Check (assuming Linux)
df -h
Better still, install AWS CloudWatch Agent on your EC2 instance to get additional metrics such as disk space usage reported into CloudWatch to help you get to the bottom of these things.
Sounds like your primary disk is full.

Airflow Scheduler - Ephemeral Storage - Evicted

I've been runnning into what should be a simple issue with my airflow scheduler. Every couple of weeks, the scheduler becomes Evicted. When I run a describe on the pod, the issue is because The node was low on resource: ephemeral-storage. Container scheduler was using 14386916Ki, which exceeds its request of 0.
The question is two fold. First, why is the scheduler utilizing ephemeral-storage? And second, is it possible to do add ephemeral-storage when running on eks?
Thanks!
I believe Ephemeral Storage is not Airflow's question but more of the configuration of your K8S cluster.
Assuming we are talking about OpenShift' ephemeral storage:
https://docs.openshift.com/container-platform/4.9/storage/understanding-ephemeral-storage.html
This can be configured in your cluster and it wil make "/var/log" ephemeral.
I think the problem is that /var/logs gets full. Possibly some of the system logs (not from airlfow but from some other processes running in the same container). I think a solution will be to have a job that cleans that system log periodically.
For example we have this script that cleans-up Airlfow logs:
https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh

why wont my aws batch job all start and run in paralllel?

hope someone can help me stop tearing my hair out!
I have a a job with array of ~700 indexes
When i submit the job , I get no more than 20-30 running simultaneously
They all run eventually which leads me to assume its a constraint else where and as all jobs the same, its not permissions/roles/connectivity.
They are array / index jobs, and one job in the queue I can't find any limits on these types of jobs running?
note i'm using ec2 unmanaged as the job was too big for fargate
i've tried
double checked they are parallel not sequential
dropped individual cpu / membory for each job to 0.25vcpu and 1gb memmory
created 'huge' compute environments of max 4096 vpu - no desired or min
added upto 3 compute env to a queue (as per limit)
what am i missing? hope someone can point me in a different direction
thanks
Ben
Based on the comments.
The issue was caused by EC2 service limits. AWS Bash will use EC2 to run the jobs, and it will not launch more resources then those specified by the EC2 limits. You can request increase the service quota of my Amazon EC2 resources to overcome the issue.

AWS Batch permits only 32 concurrent jobs in array configuration

I'm running some AI experiments that requires multiple parallel runs in order to speed up the process.
I've built and pushed a container to ECR and I'm trying to run it with AWS Batch with an Array size of 35. But only 32 starts immediately while the last three jobs remains in the RUNNABLE state and don't start until one job has finished.
I'm running Fargate Spot for cost-saving reasons with 1 vcpu and 8GB RAM.
I looked at the documentation but there are no Service Quota Limits to increase regarding the size (the max seems to be 10k) neither in Fargate, ECS and AWS Batch.
What could be the cause ?
My bad. The limit is actually imposed in the Compute Environment associated with the jobs.
I answered myself hoping to help somebody in the future.