Resource manager does not transit to active state from standby - mapreduce

One spark job was running for more than 23 days and eventually caused the resource manager to crash. After restarting the resource manager istance (there are two of them in our cluster) both of them stayed in standby state.
And we are getting this error:
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager Failed to load/recover state
org.apache.hadoop.yarn.exceptions.YarnException: Application with id application_1470300000724_40101 is already present! Cannot add a duplicate!
We could not kill 'application_1470300000724_40101' from yarn as the resource manager is not working. So we killed all the instances from Unix level on all nodes but dint work. We have tried rebooting all nodes and still the same.
Somewhere one entry of that job is still there and preventing the resource manager to get elected as active. We are using cloudera 5.3.0 and I can see that this issue has been addressed and resolved in cloudera 5.3.3. But at this moment we need a workaround to get past for now.

To resolve this issue we can format RMStateStore by executing the below command:
yarn resourcemanager -format-state-store
But be careful as this will clear all the application history that were executed before executing this command.

Related

Occasional failure on Amazon ECS with different error messages when starting task

We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
It starts a task somewhere between every other minute and every ten minutes.
It uses a set amount of task definitions. It re-uses the task definitions.
It consistently uses the same subnet in the same VPC.
The problem:
The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'
Thoughts:
It looks to me like there is a network-connectivity error of some sort.
The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
It seems like at least the network provisioning error can sometimes be an AWS issue.
Questions:
Why is this happening? Is it me or AWS?
Depending on the answer to the first question, is there something I can do to avoid this?
If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.
It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

[pacemaker]Don't restart nfs-server service

I have clustering nfs server with corosync and pacemaker.
I Installed environment successfully then i fount a problem while testing.
That screen is captured after add resources.
nfs1 server is working well, and all resources are watched by pacemaker.
The problem occur after stop nfs-service.
If i input commend "systemctl stop nfs", nfs service is stopped.
Then cluster is move to nfs2 automatically(This is ok)
Then i input commend "pcs cluster standby bp-nfs2", as a result cluster server is moved to bp-nfs1, and all resources was be satrted without nfsserver.
Even if i started again nfs-service, pacemaker's nfs resource is still stopped.
I want that pacemaker make nfs-service run when pacemaker start.
This is nfs resource create commend.
pcs resource create nfsserver ocf:heartbeat:nfsserver \
nfs_shared_infodir="/mnt/sharedisk/" \
--group resource-group
Somebody know about this issue, please teach me.
If you can't under stand my English, I'm sorry. Thank you.
Solved myself.
Reason.
If resource is stopped by trouble, the resource have FailAction.
We can see what resource have failAction with pcs status commend.
And Pacemaker never start to watch resource which have failaction, It is pacemaker's specification.
Solution.
Clear failaction manually with commend "pcs resource cleanup [resource name]".
If you want to cleanup automatically, input this commend "pcs resource defaults failure-timeout=60s".
When occur trouble your resource, your activated node move to another node(failover).
Then start up watching pacemaker resources. then cleaned up failaction automatically after 60 sec.

Kubernetes pods without any affinity suddenly stop scheduling because of MatchInterPodAffinity predicate

Without any knows changes in our Kubernetes 1.6 cluster all new or restarted pods are not scheduled anymore. The error I get is:
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (10), PodToleratesNodeTaints (2).
Our cluster was working perfectly before and I really cannot see any configuration changes that have been made before that occured.
Things I already tried:
restarting the master node
restarting kube-scheduler
deleting affected pods, deployments, stateful sets
Some of the pods do have anti-affinity settings that worked before, but most pods do not have any affinity settings.
Cluster Infos:
Kubernetes 1.6.2
Kops on AWS
1 master, 8 main-nodes, 1 tainted data processing node
Is there any known cause to this?
What are settings and logs I could check that could give more insight?
Is there any possibility to debug the scheduler?
The problem was that a Pod got stuck in deletion. That caused kube-controller-manager to stop working.
Deletion didn't work because the Pod/RS/Deployment in question had limits that conflicted with the maxLimitRequestRatio that we had set after the creation. A bug report is on the way.
The solution was to increase maxLimitRequestRatio and eventually restart kube-controller-manager.

Hello World PipeLine with ShelCommandlActivity

I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/