Troubleshooting worker usage with Ray - ray

New to Ray, trying to do some troubleshooting on distribution of tasks across the cluster.
Right now we have the head node doing all of the work, and not delegating anything at all to workers. Hoping to get some suggestions on logs to investigate to track down potential errors. How / where can I see the tasks that a worker node is processing?

Related

How to stop a compute node with SLURM?

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

Scheduled Task Produces Too Many Tasks

In one of my ECS clusters I have a scheduled Fargate task that's meant to spin up 8 instances of it's given target. However, when the task procs it starts up waaayyyy more than 8 tasks. Sometimes as many as 50. Does anyone know what could be causing this to happen?
Details:
Cron Expression: cron(40 16 ? * 1-5 *)
Target Definition:
For anyone who might run into this problem in the future:
This problem occurred because we had too many tasks running the cluster. As of the writing of this answer AWS set of limit of 50 tasks running in a single cluster. Before the rule triggered there was already close to 50 tasks running. The rule would proc and would start spinning up new tasks trying to get to the desired number (8).
However, due to the limit it would never be able to get 8 because new tasks over the limit would just get shutdown. So it would keep trying, and keep trying, and keep trying to spin up tasks which led to there being a huge pending queue of tasks that would seemingly push (nearly) all of our tasks out of the cluster and we'd be left with way more tasks than we had asked for.
The solution: we just moved the scheduled task into a new cluster to avoid the 50 task limit.

Celery/SQS task retry gone haywire - how to get rid of it?

We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!

Kubernetes pods without any affinity suddenly stop scheduling because of MatchInterPodAffinity predicate

Without any knows changes in our Kubernetes 1.6 cluster all new or restarted pods are not scheduled anymore. The error I get is:
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (10), PodToleratesNodeTaints (2).
Our cluster was working perfectly before and I really cannot see any configuration changes that have been made before that occured.
Things I already tried:
restarting the master node
restarting kube-scheduler
deleting affected pods, deployments, stateful sets
Some of the pods do have anti-affinity settings that worked before, but most pods do not have any affinity settings.
Cluster Infos:
Kubernetes 1.6.2
Kops on AWS
1 master, 8 main-nodes, 1 tainted data processing node
Is there any known cause to this?
What are settings and logs I could check that could give more insight?
Is there any possibility to debug the scheduler?
The problem was that a Pod got stuck in deletion. That caused kube-controller-manager to stop working.
Deletion didn't work because the Pod/RS/Deployment in question had limits that conflicted with the maxLimitRequestRatio that we had set after the creation. A bug report is on the way.
The solution was to increase maxLimitRequestRatio and eventually restart kube-controller-manager.

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH