Why is Google Dataproc HDFS Name Node in Safemode? - hdfs

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.

The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH

Related

How to stop a compute node with SLURM?

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :
When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?
Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html
Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.
You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Kubernetes pods without any affinity suddenly stop scheduling because of MatchInterPodAffinity predicate

Without any knows changes in our Kubernetes 1.6 cluster all new or restarted pods are not scheduled anymore. The error I get is:
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (10), PodToleratesNodeTaints (2).
Our cluster was working perfectly before and I really cannot see any configuration changes that have been made before that occured.
Things I already tried:
restarting the master node
restarting kube-scheduler
deleting affected pods, deployments, stateful sets
Some of the pods do have anti-affinity settings that worked before, but most pods do not have any affinity settings.
Cluster Infos:
Kubernetes 1.6.2
Kops on AWS
1 master, 8 main-nodes, 1 tainted data processing node
Is there any known cause to this?
What are settings and logs I could check that could give more insight?
Is there any possibility to debug the scheduler?
The problem was that a Pod got stuck in deletion. That caused kube-controller-manager to stop working.
Deletion didn't work because the Pod/RS/Deployment in question had limits that conflicted with the maxLimitRequestRatio that we had set after the creation. A bug report is on the way.
The solution was to increase maxLimitRequestRatio and eventually restart kube-controller-manager.

How to restart one live node from a multi node cassandra cluster?

I have a production cassandra cluster of 6 nodes. I have made some changes to the cassandra.yaml file on one node and hence need to restart it.
How can I do it without losing any data or causing any cluster related issues?
Can I just kill the cassandra process on that particular node and start it again.
Cluster Info:
6 nodes. All active.
I am using AWS Ec2Snitch.
Thanks.
In case you are using replication factor greater than 1, and not using ALL consistency setting on your writes/reads, you can perform steps listed below, without any downtime/data loss. In case you have one of the limitations listed above, you'll need to increase your replication factor/change requests consistency before you continue.
Perform nodetool drain on that node (http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsDrain.html).
Stop the service.
Start the service.
In Cassandra, if durable writes are enabled, you should not lose data anyway - there's a mechanism of commitlog log replay in case of accidental restart, so you should not lose any data if doing just restart, but replaying commitlog can take some time.
The steps written above are a part of official upgrade procedure, and should be the "safest" option. You can do nodetool flush + restart, this will ensure that commitlog replay will be minimal and can be faster than drain approach.
Can I just kill the cassandra process on that particular node and start it again.
Essentially, yes. I'm assuming you have a RF of 3 with 6 nodes, so it shouldn't be a big deal. If you want, to do what I call a "clean shutdown" you could run the following commands first:
nodetool disablegossip
nodetool drain
And then (depending on your install):
sudo service cassandra stop
Or:
kill `cat cassandra.pid`
Note that if you do not complete these steps, you should still be ok. The drain just flushes the memtables to disk. If that doesn't happen, the commit log is reconciled against what's on disk at boot-time anyway. Those steps will just make your boot go faster.

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.