I've been runnning into what should be a simple issue with my airflow scheduler. Every couple of weeks, the scheduler becomes Evicted. When I run a describe on the pod, the issue is because The node was low on resource: ephemeral-storage. Container scheduler was using 14386916Ki, which exceeds its request of 0.
The question is two fold. First, why is the scheduler utilizing ephemeral-storage? And second, is it possible to do add ephemeral-storage when running on eks?
Thanks!
I believe Ephemeral Storage is not Airflow's question but more of the configuration of your K8S cluster.
Assuming we are talking about OpenShift' ephemeral storage:
https://docs.openshift.com/container-platform/4.9/storage/understanding-ephemeral-storage.html
This can be configured in your cluster and it wil make "/var/log" ephemeral.
I think the problem is that /var/logs gets full. Possibly some of the system logs (not from airlfow but from some other processes running in the same container). I think a solution will be to have a job that cleans that system log periodically.
For example we have this script that cleans-up Airlfow logs:
https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh
Related
I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.
Recently after a production deployment, our primary service was not reaching steady state. On analysis we found out that the filebeat service running as a daemon service was unsteady. The stopped tasks were throwing "no space left on the device". Also, the CPU and memory utilization for the filebeat was higher than the primary service.
A large amount of log files were being stored as part of the release. After reverting the change, the service came back to steady state.
Why did filebeat become unsteady? If memory was an issue, then why didn't the service also throw "no space" issue as both filebeat and primary service runs on the same EC2 instance?
Check (assuming Linux)
df -h
Better still, install AWS CloudWatch Agent on your EC2 instance to get additional metrics such as disk space usage reported into CloudWatch to help you get to the bottom of these things.
Sounds like your primary disk is full.
I have a Spark application in Java, running on AWS EMR. I have implemented an AutoScaling policy based on the available yarn memory. For jobs which require higher memory, EMR brings up cluster up to 1+8 nodes.
After a point of time in my job I keep getting the below error, this error goes on for hours before I terminate cluster manually.
java.io.IOException: All datanodes [DatanodeInfoWithStorage[<i.p>:50010,DS-4e7690c7-5946-49c5-b203-b5166c2ff58d,DISK]] are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1531)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
This error is in the very first worker node that was spawned. After some digging, I found out this might be because of ulimit. Now increasing ulimit can be done easily on any Linux or EC2 machines manually. But I am unable to get how to do this dynamically every EMR cluster that is spawned.
Further, I am not even 100% sure if ulimit is causing this particular issue. This might be something else as well. I can confirm only once I change ulimit and check.
One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..
So what I've tried to do to date to debug this is
Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
Checked CloudTrail event logs for any event pertinent to this issue, nothing found
Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..
Any have any further debugging device or ideas what might be going on here?
I ran into the same issue and found this from aws
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html
When AWS determines that a security or infrastructure update is needed
for an Amazon ECS task hosted on AWS Fargate, the tasks need to be
stopped and new tasks launched to replace them.
Also a github post on storing stopped tasks info in cloudwatch logs:
https://github.com/aws/amazon-ecs-agent/issues/368
We're considering migrating our data pipelines to Airflow and one item we require is the ability for a task to create, execute on, and destroy an EC2 instance. I know that Airflow supports ECS and Fargate which will have a similar effect, but not all of our tasks will fit directly into that paradigm without significant refactoring.
I see that we can use a distributed executor and scale the pool of workers up and down manually, but we really don't need to have workers up all the time, only occasionally, and when we do we're just as well served by having a dedicated machine for each task as it runs, destroying each machine as the task completes.
The idea I have stuck in my head would be something like a "EphemeralEC2Operator", which would stand up a machine, SSH in, run a bash script which orchestrates the task, and then tear the machine down.
Does this capability exist, or would we have to implement it ourselves?
Thanks in advance.