Mapreduce job restarts twice and fails - mapreduce

I am trying to submit a simple hive query with inner join. I could see in map reduce logs mapper increases to 80%, then restarts and again increases to 90%, then restarts to 0% and finally fails with below error:
8368527 [main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Please assist what could be the issue.

Related

"waitMsAvg" is always 0 for running BQ jobs?

Every time I see my BQ job status through the API, it always returns 0 for waitMsAvg when the job is running.
After the job is finished, then I can see valid numbers for waitMsAvg.
Is it intended behavior, or is it a bug?
How can I see the values when the job is running?
BigQuery query plan provides timing classification for each query stage wherein
waitMsAvgis the avg time spent by workers to get scheduled.
You are getting waitMsAvg as 0 because your query is still running. Once the query job completes waitMsAvg will denote the average time worker spent waiting to be scheduled.

Airflow - Subdag stuck in running state with its internal tasks ended successfully

we're facing a very strange behavior in Airflow.
Some subdags stuck running (with its internal tasks successfully ended).
For example, I have the subdag load-folder-to-layer that should have started and ended in 2020-11-18, but it stuck until 20th.
I looked in the task_instances and job table and could see that the task's job was executing and receiving heartbeat:
The last log message in the subdag is:
[2020-11-18 09:22:01,879] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: bietlejuice.docx.load-folder-reference-to-clean 2020-11-17T03:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Also, I could notice a node drop on the exact same day 18th (image):
This leads me to think that the Scheduler is facing a bug when reatributing this task(subdag) to another worker and leading the task to get stuck.
Does someone have a clue about it?

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

BigQuery unable to insert job. Workflow failed

I need to run a batch job from GCS to BigQuery via Dataflow and Beam. All my files are avro with the same schema.
I've created a dataflow java application that is successful on a smaller set of data (~1gb, about 5 files).
But when I try to run it on a bigger set of data ( >500gb, >1000 files), i receive an error message
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix 1b83679a4f5d48c5b45ff20b2b822728_6e48345728d4da6cb51353f0dc550c1b_00001_00000, reached max retries: 3, last failed load job: ...
After 3 retries it terminates with:
Workflow failed. Causes: S57....... A work item was attempted 4 times without success....
This step is the load to BigQuery.
Stack Driver says the processing is stuck in step ....for 10m00s... and
Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes.....
I looked up the 409 error code stating that I might have an existing job, dataset, or table. I've removed all the tables and re-ran the application but it still shows the same error message.
I am currently limited on 65 workers and I have them using n1-standard-4 cpus.
I believe there are other ways to move the data from gcs to bq, but i need to demonstrate dataflow.
"java.lang.RuntimeException: Failed to create job with prefix beam_load_csvtobigqueryxxxxxxxxxxxxxx, reached max retries: 3, last failed job: null.
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:198)..... "
One of the possible cause could be the privilege issue. Ensure the user account which interacts with the BigQuery has privilege "bigquery.jobs.create" in the predefined role "*BigQuery User"
Posting the comment of #DeaconDesperado as community wiki, where they experienced the same error and what they did was remove the restricted characters (eg. Unicode letters, marks, numbers, connectors, dashes or spaces) in the table name and the error is gone.
I got the same problem using "roles/bigquery.jobUser", "roles/bigquery.dataViewer", and "roles/bigquery.user". But only when granting "roles/bigquery.admin" did the issue get resolved.

Google Dataflow "Workflow failed" with no reason

I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations.
The logs I get are the following:
2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
2017-08-25 (00:06:48) Workflow failed.
2017-08-25 (00:06:48) Stopping worker pool...
2017-08-25 (00:06:58) Worker pool stopped.
How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully.
When I try to rerun the template from Google Cloud Console, I get the message:
No metadata file found for this template
But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.
Any Input is highly appreciated. Thanks
EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.
I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.
It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.