pending - Waiting for next available executor in Hudson - build

In hudson, am getting "pending - Waiting for next available executor" while building my job.Its showing Build Executor status as "Dead(!)"
Can anyone tell how to overcome this issue?
how to make hudson dead executor alive?

Related

Data Fusion pipelines fail without execute

I have more than 50 datafusion pipelines running concurrently in an Enterprise istance of DataFusion.
About 4 of them randomly fail at each concurrent run, showing in the logs only the operation of provision followed by the deprovision of the Dataproc cluster, as in this log:
2021-04-29 12:52:49,936 - INFO [provisioning-service-4:i.c.c.r.s.p.d.DataprocProvisioner#203] - Creating Dataproc cluster cdap-fm-smartd-cc94285f-a8e9-11eb-9891-6ea1fb306892 in project project-test, in region europe-west2, with image 1.3, with system labels {goog-datafusion-version=6_1, cdap-version=6_1_4-1598048594947, goog-datafusion-edition=enterprise}
2021-04-29 12:56:08,527 - DEBUG [provisioning-service-1:i.c.c.i.p.t.ProvisioningTask#116] - Completed PROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
2021-04-29 13:04:01,678 - DEBUG [provisioning-service-7:i.c.c.i.p.t.ProvisioningTask#116] - Completed DEPROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
When a failed pipeline is restarted it completes the execution with success.
All the pipeline are started and monitored via Composer using async start and custom wait SensorOperator.
There is no warning of quota exceeded.
Additional info:
Data Fusion 6.1.4
with Dataporc ephemeral cluster with 1 master 2 workers. Image version 1.3.89
EDIT
The appfabric log releted to each failed pipeline are:
WARN [program.status:i.c.c.i.a.r.d.DistributedProgramRuntimeService#172] - Twill RunId does not exist for the program program:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow, runId f34a6fb4-acb2-11eb-bbb2-26edc49aada0
WARN [pool-11-thread-1:i.c.c.i.a.s.RunRecordCorrectorService#141] - Fixed RunRecord for program run program_run:default.[piepleine_name].-SNAPSHOT.workflow.DataPipelineWorkflow.fdc22f56-acb2-11eb-bbcf-26edc49aada0 in STARTING state because it is actually not running
Further research connected somehow the problem to an inconsistent state in the CDAP run records, when many concurrent requests (via REST API) are made.

Airflow - Subdag stuck in running state with its internal tasks ended successfully

we're facing a very strange behavior in Airflow.
Some subdags stuck running (with its internal tasks successfully ended).
For example, I have the subdag load-folder-to-layer that should have started and ended in 2020-11-18, but it stuck until 20th.
I looked in the task_instances and job table and could see that the task's job was executing and receiving heartbeat:
The last log message in the subdag is:
[2020-11-18 09:22:01,879] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: bietlejuice.docx.load-folder-reference-to-clean 2020-11-17T03:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Also, I could notice a node drop on the exact same day 18th (image):
This leads me to think that the Scheduler is facing a bug when reatributing this task(subdag) to another worker and leading the task to get stuck.
Does someone have a clue about it?

How to relaunch a Spark executor after it crashes (in YARN client mode)?

Is it possible to relaunch a Spark executor after it crashes? I understand that the failed tasks are re-run in the existing working Spark executors, but I hope there is a way to relaunch the crashed Spark executor.
I am running pyspark 1.6 on YARN, in client mode
No. It is not possible. Spark takes care of it and when an executor dies, it will request a new one the next time it asks for "resource containers" for executors.
If the executor was close to the data to process Spark will request for a new executor given locality preferences of the task(s) and chances are that the host where the executor has died will be used again to run the new one.
An executor is a JVM process that spawns threads for tasks and honestly does not do much. If you're concerned with the data blocks you should consider using Spark's external shuffle service.
Consider reading the document Job Scheduling in the official documentation.

Impact of AWS SWF connectivity on currently executing workflows?

I am curious to understand the loss of connectivity with AWS SWF on currently executing workflows. Could someone please help me understand.
I understand there would be timeout of deciders and workers. But not sure of the exact behavior.
Activity worker that waits on a poll will get an error and is expecting to keep retrying until connectivity is back. Activity worker that has completed a task is expected to keep retrying to complete the task until the task is expired.
Workflow worker that waits on a poll will get an error and is expecting to keep retrying until connectivity is back. Workflow worker that has completed a decision task can retry to complete it until it is expired. After it is expired the decision task is automatically rescheduled and is available for poll as soon as connectivity is back.
Scheduled activity that wasn't picked up for a specified schedule to start timeout is automatically failed. Its failure is recorded into workflow history and the new decision is scheduled.
Picked up activity that wasn't completed for a specified start to complete timeout is automatically failed. Its failure is recorded into workflow history and the new decision is scheduled.

Workflow handling on Camunda engine restart

Scenario : Few jobs are running currently. If cluster reboot happens in the middle of the job execution, I shall be able to observe the continuity of process instance execution with proper state after reboot.
Will Camunda take care of preserving the process instance state by using some checkpoints and resumes automatically from where it halted ?
If you have reached at least one asynchronous continuation (e.g. check the property "async after" or on the start event), then the process instance has been persistent to the database and a job scheduled. Any crash would lead the following transaction to not commit and rollback. The job executor will restart processing from the last commit point when it detects a due job.