How to restart one live node from a multi node cassandra cluster? - amazon-web-services

I have a production cassandra cluster of 6 nodes. I have made some changes to the cassandra.yaml file on one node and hence need to restart it.
How can I do it without losing any data or causing any cluster related issues?
Can I just kill the cassandra process on that particular node and start it again.
Cluster Info:
6 nodes. All active.
I am using AWS Ec2Snitch.
Thanks.

In case you are using replication factor greater than 1, and not using ALL consistency setting on your writes/reads, you can perform steps listed below, without any downtime/data loss. In case you have one of the limitations listed above, you'll need to increase your replication factor/change requests consistency before you continue.
Perform nodetool drain on that node (http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsDrain.html).
Stop the service.
Start the service.
In Cassandra, if durable writes are enabled, you should not lose data anyway - there's a mechanism of commitlog log replay in case of accidental restart, so you should not lose any data if doing just restart, but replaying commitlog can take some time.
The steps written above are a part of official upgrade procedure, and should be the "safest" option. You can do nodetool flush + restart, this will ensure that commitlog replay will be minimal and can be faster than drain approach.

Can I just kill the cassandra process on that particular node and start it again.
Essentially, yes. I'm assuming you have a RF of 3 with 6 nodes, so it shouldn't be a big deal. If you want, to do what I call a "clean shutdown" you could run the following commands first:
nodetool disablegossip
nodetool drain
And then (depending on your install):
sudo service cassandra stop
Or:
kill `cat cassandra.pid`
Note that if you do not complete these steps, you should still be ok. The drain just flushes the memtables to disk. If that doesn't happen, the commit log is reconciled against what's on disk at boot-time anyway. Those steps will just make your boot go faster.

Related

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Could an HDFS read/write process be suspended/resumed?

I have one question regarding the HDFS read/write process:
Assuming that we have a client (for the sake of the example let's say that the client is a HADOOP map process) who requests to read a file from HDFS and or to write a file to HDFS, which is the process which actually does the read/write from/to the HDFS?
I know that there is a process for the Namenode and a process for each Datanode, what are their responsibilities to the system in general but I am confused in this scenario.
Is it the client's process by itself or is there another process in the HDFS, created and dedicated to the this specific client, in order to access and read/write from/to the HDFS?
Finally, if the second answer is true, is there any possibility that this process can be suspended for a while?
I have done some research and the most important solutions that I found were Oozie and JobControl class from hadoop API.
But, because I am not sure about the above workflow, I am not sure what process I am suspending and resuming with these tools.
Is it the client's process or a process which runs in HDFS in order to serve the request of the client?
Have a look at these SE posts to understand how HDFS writes work:
Hadoop 2.0 data write operation acknowledgement
Hadoop file write
Hadoop: HDFS File Writes & Reads
Apart from file/block writes, above question explain about datanode failure scenarios.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.
One failure in datanode triggers corrective actions by framework.
Regarding your second query :
You have two types of schedulers :
FairScheduler
CapacityScheduler
Have a look at this article on suspend and resume
In a multi-application cluster environment, jobs running inside Hadoop YARN may be of lower-priority than jobs running outside Hadoop YARN like HBase. To give way to other higher-priority jobs inside Hadoop, a user or some cluster-level resource scheduling service should be able to suspend and/or resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and running task containers will continue to run until their completion or active preemption by other ways. But no more new containers would be allocated to the target jobs.
In contrast, when suspended jobs are put into resume mode, they will continue to run from the previous job progress and have new task containers allocated to complete the rest of the jobs.
So as far as I understand the process of a Datanode receives the data from the client's process (who requests to store some data in HDFS) and stores it. Then this Datanode forwards the exact same data to another Datanode (to achieve replication) and so on. When the replication will finish, an acknowledgement will go back to the Namenode who will finally inform the client about the completion of his write-request.
Based on the above flow, It is impossible to suspend an HDFS write operation in order to serve a second client's write-request (let's assume that the second client has higher priority) because if we suspend the Datanode by itself it will remain suspended for everyone who wants to write on it and as a result this part of the HDFS will be remained blocked. Finally, if I suspend a job from JobController class functions, I actually suspend the client's process (if I actually manage to catch it before his request will be done). Please correct me if I am wrong.

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.

Recommendation for batch processing on aws

I'm new to using AWS, so any pointers would be appreciated.
I have a need to process large files using our in-house software.
It takes about 2GB of input and generates 5GB of output, running for 2 hours on a c3.8xlarge.
For now I do it manually, start an instance (either on-demand or spot-request), but now I want to reliably automate and scale this processing - what are good frameworks or platform or amazon services to do that?
Especially regarding the possibility that a spot-instance will be terminated half-way through (and I'll need to detect that and restart the job).
I heard about Python Celery, but does it work well with amazon and spot-instances?
Or are there other recommended mechanisms?
Thank you!
This is somewhat opinion-based, but you can mix and match some of the AWS pieces to make this easier:
put the input data on S3
push an entry into a SQS queue indicating a job needs to be processed with a long visibility timeout
set up an autoscaling policy based on SQS with your machine description in CloudFormation.
use UserData/cloudinit to set up the machine and start your application
write code to receive the queue entry, start processing, finish processing, then delete the SQS message.
code should check for another queued entry. If none, code should terminate machine.