How do i get parallel jobs working on failure - scheduling

If I have a set of parallel jobs and 1 fails and the other succeeds.
I try the resume execution/retry failed nodes, it triggers both the jobs again.
Is there any setting in rundeck which can trigger only the failed job and not rerun the entire group again ? Is this a bug ?

The reason is that job reference steps is running on a single job, by design, rundeck consider that as a parent job exection and not individual executions (child jobs, the parallel jobs). If you want to avoid this, you will run these jobs individually (in a job calling each job using rd cli or Rundeck api) in a inline-script step.
In that way you can retry only the failed execution.
Now, to resume since a failed step, you can use the Job Resume Plugin (only for Rundeck Enterprise). The plugin allows to resume in the job failed step.

Related

When running GitHub actions with a concurrency restriction, can I get workflow runs enqueued rather than cancelled?

The documentation of GitHub actions says:
You can use jobs.<job_id>.concurrency to ensure that only a single job or workflow using the same concurrency group will run at a time.
...
When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.
It is annoying that previously pending jobs get cancelled. Evidently the orchestration logic can only maintain a tiny "queue" of one (1) pending job.
I would like to be able to have multiple jobs enqueued. I.e., if I trigger 5 jobs in rapid succession, and they all belong to the same concurrency group, then the first one starts to run immediately (when a runner is availble) and the next 4 get enqueued and wait for their turn to run, one at a time.
Is there any way to achieve this? Or will I need to request this as a feature from GitHub?

How to restart an AWS Data Pipeline

I have a scheduled AWS Data Pipeline that failed partway through its execution. I fixed the problem without modifying the Pipeline in any way (changed a script in S3). However, there seems to be no good way to restart the Pipeline from the beginning.
I tried Deactivating/Reactivating the Pipeline, but the previously "FINISHED" nodes were not restarted. This is expected; according to the docs, this only pauses and un-pauses execution of the Pipeline, which is not that we want.
I tried Rerunning one of the nodes (call it x) individually, but it did not respect dependencies: none of the nodes x depends on reran, nor did the nodes that depend on x.
I tried activating it from a time in the past, but received the error: startTimestamp should be later than any Schedule StartDateTime in the pipeline (Service: DataPipeline; Status Code: 400; Error Code: InvalidRequestException; Request ID: <SANITIZED>).
I would rather not change the Schedule node, since I want the Pipeline to continue to respect it; I only need this one manual execution. How can I restart the Pipeline from the beginning, once?
So far, the best way to accomplish this that I've found is to Clone the Pipeline, make it On-Demand (instead of Scheduled) and activate that one. This new Pipeline will activate and run immediately. This seems cumbersome, however; I'd be happy to hear a better way.
The ActivatePipeline API has a startTimestamp parameter using which you can restart execution from any previous time interval. Please see http://docs.aws.amazon.com/datapipeline/latest/APIReference/API_ActivatePipeline.html

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

Workflow handling on Camunda engine restart

Scenario : Few jobs are running currently. If cluster reboot happens in the middle of the job execution, I shall be able to observe the continuity of process instance execution with proper state after reboot.
Will Camunda take care of preserving the process instance state by using some checkpoints and resumes automatically from where it halted ?
If you have reached at least one asynchronous continuation (e.g. check the property "async after" or on the start event), then the process instance has been persistent to the database and a job scheduled. Any crash would lead the following transaction to not commit and rollback. The job executor will restart processing from the last commit point when it detects a due job.

Concurrency in running Oozie workflow: how many and how to throttle

Let us say we have a Oozie workflow that has a copy action node then a Shell action node. Can I start multiple instances of such a OOzie workflow and run them in parallel? How about the concurrency number could spike to thousands and/or even millions level. Is that possible, or even Oozie supports that high level concurrency?
If not, then we will have to consider throttling and enforce a cap on how many concurrent Oozie workflow instances can be. We'd prefer to throttle this on server/Oozie side (basically with any out of box Oozie software functionality), not on client/callee side. For example, we have a huge launch script with lines like this. We want to run that in a single shot, then let Oozie figure out how to throttle all these instances on itself. We don't want to split it into multiple smaller chunks, then kick off one chunk at a time.
oozie job -oozie http://myhost.com:11000/oozie -config job1.properties -run
oozie job -oozie http://myhost.com:11000/oozie -config job2.properties -run
......
oozie job -oozie http://myhost.com:11000/oozie -config job1000000.properties -run
You will not be able to have a higher Oozie workflow concurrency than the number of map slots on your cluster because a Shell action is run by a one-mapper-zero-reducer MR job.
If you have many instances of a workflow to get through then the best mechanism is to use an Oozie coordinator. This will keep track of the completion of each instance and easily manage concurrency. An Oozie coordinator has a <concurrency> tag that controls how many instances of the workflow will execute in parallel, and a <throttle> tag that controls how many instances are brought into a waiting state before there is free concurrency for one to begin.
See: https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html#a6.3._Synchronous_Coordinator_Application_Definition
Note that the default behavior of an Oozie coordinator is to wait 5 minutes between each polling of whether a new instance should be created. If your workflows run in less than 5 minutes then the process will bottleneck on this interval. You can change this with the oozie.service.CoordMaterializeTriggerService.lookup.interval property (in seconds) in your oozie-site.xml file.