Can I rerun failed mappers in EMR - mapreduce

I just woke up to a failed 16h long EMR MpaReduce job that failed because of a 'few' mappers that timed out.
Is there a way to rerun only those failed mappers (yes it makes sense in my specific use case)? How?

Too late for a real-time question. In general - No.
But sometimes it's possible. If you can take the trouble of finding out exactly what splits were being processed by the failed mappers (from the mapper logs) - and if this was a map-only job - then you could create a custom job that went only after the failed splits. Very hard in general - especially since splits typically span files.

Related

How to clear aws batch job history in dashbord

In aws batch Job queues dashboard, it shows all job status failed and succeeded job count for 24 hours. Is it possible to reset counter to zero?
No, it's not possible to clear jobs. Batch keeps finished jobs around for at least a day (and in my experience occasionally up to a few weeks), and there's no API or console mechanism to accelerate the process.

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

Google Dataflow "Workflow failed" with no reason

I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations.
The logs I get are the following:
2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
2017-08-25 (00:06:48) Workflow failed.
2017-08-25 (00:06:48) Stopping worker pool...
2017-08-25 (00:06:58) Worker pool stopped.
How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully.
When I try to rerun the template from Google Cloud Console, I get the message:
No metadata file found for this template
But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.
Any Input is highly appreciated. Thanks
EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.
I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.
It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH

Concurrency in running Oozie workflow: how many and how to throttle

Let us say we have a Oozie workflow that has a copy action node then a Shell action node. Can I start multiple instances of such a OOzie workflow and run them in parallel? How about the concurrency number could spike to thousands and/or even millions level. Is that possible, or even Oozie supports that high level concurrency?
If not, then we will have to consider throttling and enforce a cap on how many concurrent Oozie workflow instances can be. We'd prefer to throttle this on server/Oozie side (basically with any out of box Oozie software functionality), not on client/callee side. For example, we have a huge launch script with lines like this. We want to run that in a single shot, then let Oozie figure out how to throttle all these instances on itself. We don't want to split it into multiple smaller chunks, then kick off one chunk at a time.
oozie job -oozie http://myhost.com:11000/oozie -config job1.properties -run
oozie job -oozie http://myhost.com:11000/oozie -config job2.properties -run
......
oozie job -oozie http://myhost.com:11000/oozie -config job1000000.properties -run
You will not be able to have a higher Oozie workflow concurrency than the number of map slots on your cluster because a Shell action is run by a one-mapper-zero-reducer MR job.
If you have many instances of a workflow to get through then the best mechanism is to use an Oozie coordinator. This will keep track of the completion of each instance and easily manage concurrency. An Oozie coordinator has a <concurrency> tag that controls how many instances of the workflow will execute in parallel, and a <throttle> tag that controls how many instances are brought into a waiting state before there is free concurrency for one to begin.
See: https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
Note that the default behavior of an Oozie coordinator is to wait 5 minutes between each polling of whether a new instance should be created. If your workflows run in less than 5 minutes then the process will bottleneck on this interval. You can change this with the oozie.service.CoordMaterializeTriggerService.lookup.interval property (in seconds) in your oozie-site.xml file.