Is there a way to finish manual task synchronously (without waiting for async result) if some precondition is satisfied? - amazon-web-services

I am using AWS SWF and flow framework. I wanted to make my activities idempotent so that a workflow can be restarted from the beginning after any failure. Many of the activities are manual tasks (#ManualActivityCompletion) which need to be completed asynchronously.
Is there a way to finish manual tasks like normal tasks if I know that it is already complete? This way a new manual task will not be scheduled everytime the workflow is retried.
Or, is there a way to retry a workflow so that it starts from the point it failed?

Currently there is no way to override activity completion behavior at runtime. The work around is to complete activity using ManualActivityCompletionClient from within activity implementation.
There is no supported way to retry workflow to start from the point of failure.

Related

When running GitHub actions with a concurrency restriction, can I get workflow runs enqueued rather than cancelled?

The documentation of GitHub actions says:
You can use jobs.<job_id>.concurrency to ensure that only a single job or workflow using the same concurrency group will run at a time.
...
When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.
It is annoying that previously pending jobs get cancelled. Evidently the orchestration logic can only maintain a tiny "queue" of one (1) pending job.
I would like to be able to have multiple jobs enqueued. I.e., if I trigger 5 jobs in rapid succession, and they all belong to the same concurrency group, then the first one starts to run immediately (when a runner is availble) and the next 4 get enqueued and wait for their turn to run, one at a time.
Is there any way to achieve this? Or will I need to request this as a feature from GitHub?

What happens when we trigger the SWF Flows #Execute method multiple times?

We have a usecase where we start a workflow (by invoking #Execute method) and the we schedule a timer for a subsequent activity. Now, this triggering of workflow is based on API call which can be triggered multiple times by a client.
Wanted to know how SWF flow handled the multiple invocations of #Execute method.
Does it create multiple executions ?
or would there be multiple timer clocks scheduled for same workflow execution ?
SWF allows only one open workflow execution per ID. So if the workflow is still running calling the Execute method again is going to return WorkflowExecutionAlreadyStartedFault.
Note that if a workflow is completed the new workflow is going to start even for the same ID.
The temporal.io which is an open source version of SWF has an additional WorkflowIdReusePolicy which specifies what should be done if there are already completed workflows.

How to implement SWF exponential retries using the aws sdk

I'm trying to implement a jruby SWF activity worker using AWS SDK v2.
I cannot use the aws-flow-ruby framework since it's not compatible with jruby(forking), so I wrote a worker that uses threading.
https://github.com/djpate/jflow if people are interested.
Anyway, in the framework they implement retries and It seems that it actually schedules the same activity later if an activity failed.
I found everywhere in the AWS docs and cannot find how to send that signal back to SWF using the SDK http://docs.aws.amazon.com/sdkforruby/api/Aws/SWF/Client.html
Anyone know where I should look?
From the question, I believe you are somewhat confused about what SWF is / how it works.
Activities don't run and are not retried in isolation. Everything happens in the context of a workflow. The workflow definition tell you when to retry and how to behave if activities fail/timeout etc.
The worker that processes the workflow definition and schedules the next thing that needs to happen is referred to as a decider. (you will see decider and workflow used interchangeably). It's called a decider because based on the current state it makes the decision on what the next activity that needs to be scheduled is. The decider normally takes the workflow history as input when making this input.
In Flow for example, the retry is encoded in the workflow logic. Basically if the activity fails you can just schedule it.
So to finally answer your question: if your target is to only implement the activity workers you don't need to implement any retry logic as that happens at the decider level. You should make sure that the activities are compatible with the decider (you need to make sure the history and the input/output convention are the same).
If your target is to implement your own framework on top of SWF you need to actually do the hard work needed to make the decider work.

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

How to kill /re-start a long running task

Is there a way to kill / re-start a long running task in AWS SWF? Sometimes some of our tasks run for a longer duration and we would like to manually kill a certain task (either via UI or programmatically) and re-start the task if possible. How to achieve this?
Console is option to manually kill workflow.
You can also set timeouts to whole workflow execution time or to individual activities. This can be set when you register your activity or when you start your activity (defaultTaskStartToCloseTimeoutSecond).
It's not clear what language you're using.
If you're using java, then you should look into Exponential Retry in Flow Framework. This make SDK restart your activity if it fails.
Long running activity is expected to heartbeat using RecordActivityTaskHeartbeat. It leads to timeout failure after short hearbeat interval instead of long task execution timeout if the activity process hangs or crashes.
The workflow code (decider) can always request activity cancellation through RequestCancelActivityTask decision. The cancellation request is returned as output of the RecordActivityTaskHeartbeat call. Activity implementation should cancel itself and report back to the service using RespondActivityTaskCanceled API call.
See Error Handling section of AWS Flow Framework Developer Guide for the AWS Flow Framework way of cancelling activities.
Sometimes activity implementation cannot support heartbeating and self cancellation. The solution is to execute another kill activity that terminates the first activity execution. For example under Unix such kill activity could emit "kill -9" command for the process that implements the first one.