Using Timers/Signals to allow human intervention in AWS SWF Workflow - amazon-web-services

Here's the scenario. A user uploads an Excel file and this kicks off a workflow which validates the file, transforms it into a few different files, then performs an update to a database based on the transforms. After the uploads, the results need to be reviewed by team member before the flow can continue.
I'm using Ruby and have discovered that Signals and Timers are the way to achieve this in SWF. However, the Ruby examples are lacking or non-existent and I need a little help understanding how this would work using Ruby.
Ny understanding so far is that a Timer activity is scheduled which basically pauses the flow until either the timer expires (at which point I could cancel the workflow or email the staff and set another timer) or a signal is sent to the workflow to start the next step. The Decider would handle the signal and then kick off the appropriate activity.
Any thoughts or direction to other sources would be much appreciated.
Thanks,
Thomas

It's somewhat difficult to provide an "answer", given you didn't really ask a specific question. I'm in agreement with you that using a Timer and Signals is what you want.
You don't specify how the team gets notified about the review. I'll assume that you notify them by email and direct them to some website where they can review the changes, and then click on a link to either Approve or Don't Approve. Clicking the link to Approve will send a request to a web server that will "signal" SWF that the review has been approved. Clicking the link to Don't Approve will "signal" SWF that the review has not been approved. You mention that you want to renotify the team (or perhaps escalate to the manager) if no one has taken action on the review. Let's say this renotification happens after 48 hours. After the renotication, you grant them another 72 hours before assumming Don't Approve.
Here's how your workflow looks like to me:
User uploads file and kicks off a workflow
Decider Task schedules "TransformActivity"
TransformActivity runs, transforms the data into different files, and completes successfully
Decider Task schedules "UpdateDatabaseActivity"
UpdateDatabaseActivity runs, updates the database, and completes successfully
Decider Task schedules "EmailTeamActivity"
EmailTeamActivity runs, emails the team, and completes successfully
Decider Task schedules a Timer for 48 hours.
If a signal indicating Approve or Don't Approve is received within 48 hours:
Decider Task schedules the "RecordFinalDecisionActivity"
RecordFinalDecisionActivity will run, record the Approve (or Don't Approve) into the database, and complete successfully.
Decider Task will then close the workflow because it's done.
If no signal is received and the timer fires (after 48 hours):
Decider Task schedules the "EmailTeamAndManagerActivity"
EmailTeamAndManagerActivity runs, emails the team and manager, and completes successfully.
Decider Task schedules another timer for 72 hours.
If a signal indicating Approve or Don't Approve is received within the additional 72 hours given:
Repeat the same logic as the section "If a signal indicating Approve or Don't Approve is received within 48 hours".
If no signal is received and the timer fires (after the additional 72 hours):
At this point, the workflow can assume it was a Don't Approve, schedule the "RecordFinalDecisionActivity" and close the workflow once that activity completes.
The reason why you don't want to have a "review" activity is because that task gets scheduled and then some activity worker needs to reply success. How would that work? When someone clicks the Approve or Don't Approve link, the request to the webserver would have to pull down the activity from the task list. However, if the task list has multiple activities, SWF just gives out any one of them. It might not get the right one. Now, you could argue that you could schedule the different reviews across different task lists, but that's just cumbersome and tedious.
Signals are done to indicate an "external" event, which this very much is. The SWF documentation on Signals does a great job on talking about Signals. Here's the SWF documentation on how to use Timers and Signals. As for the particulars on how to use SWF and Ruby, I can't really help you there. I've only used SWF with Java by using the AWS Flow Framework.

user upload excel file, does "StartWorkflowExecution", that queues a decision task
decision worker notice flow is new / "stage one", it schedules "transform file" activity task
activity worker picks up task, and does the "transform file" activity, when done does "RespondActivityTaskCompleted" with a result of "transformations done", that queues a decision task
decision worker picks up decision task, notices the transformations are done and schedule a new activity task
activity worker picks up activity task, notices it's for a team member (according to the instructions given by the decision worker when scheduling the activity task), team member gets notified, somehow perform his action, then somehow notifies the activity worker which will reply "RespondActivityTaskCompleted"
I don't see the need for a Timer or a Signal, it's just plain flow. Those two concepts are useful if you want recurring events, timeouts, and/or interrupting the flow.
Please note that you can differentiate activity workers by using task lists (for example activity workers for automated work vs activity workers for human participants, whatever).

Related

When running GitHub actions with a concurrency restriction, can I get workflow runs enqueued rather than cancelled?

The documentation of GitHub actions says:
You can use jobs.<job_id>.concurrency to ensure that only a single job or workflow using the same concurrency group will run at a time.
...
When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.
It is annoying that previously pending jobs get cancelled. Evidently the orchestration logic can only maintain a tiny "queue" of one (1) pending job.
I would like to be able to have multiple jobs enqueued. I.e., if I trigger 5 jobs in rapid succession, and they all belong to the same concurrency group, then the first one starts to run immediately (when a runner is availble) and the next 4 get enqueued and wait for their turn to run, one at a time.
Is there any way to achieve this? Or will I need to request this as a feature from GitHub?

How to implement SWF exponential retries using the aws sdk

I'm trying to implement a jruby SWF activity worker using AWS SDK v2.
I cannot use the aws-flow-ruby framework since it's not compatible with jruby(forking), so I wrote a worker that uses threading.
https://github.com/djpate/jflow if people are interested.
Anyway, in the framework they implement retries and It seems that it actually schedules the same activity later if an activity failed.
I found everywhere in the AWS docs and cannot find how to send that signal back to SWF using the SDK http://docs.aws.amazon.com/sdkforruby/api/Aws/SWF/Client.html
Anyone know where I should look?
From the question, I believe you are somewhat confused about what SWF is / how it works.
Activities don't run and are not retried in isolation. Everything happens in the context of a workflow. The workflow definition tell you when to retry and how to behave if activities fail/timeout etc.
The worker that processes the workflow definition and schedules the next thing that needs to happen is referred to as a decider. (you will see decider and workflow used interchangeably). It's called a decider because based on the current state it makes the decision on what the next activity that needs to be scheduled is. The decider normally takes the workflow history as input when making this input.
In Flow for example, the retry is encoded in the workflow logic. Basically if the activity fails you can just schedule it.
So to finally answer your question: if your target is to only implement the activity workers you don't need to implement any retry logic as that happens at the decider level. You should make sure that the activities are compatible with the decider (you need to make sure the history and the input/output convention are the same).
If your target is to implement your own framework on top of SWF you need to actually do the hard work needed to make the decider work.

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

Does AWS SWF let work flow to sleep some time?

Every time a customer completes a transaction a reminder workflow starts for that customer which tries to remind a customer about few actions that he/she has to perform. There are points in the flow where I know for sure that there is no task to be performed. So in this case I want the workflow to go to sleep for some time and come back to life later. I want this sleep feature to avoid database call as my decider does one database query every time it gets a task.
I have gone through the AWS documentation here . But found nothing there (Please point me to document if the feature exists). Does AWS-SWF provide such a feature. If it does not provide a feature of this type then what is smart and clean way of doing this.
A small example of flow I want to create :
1. End of transaction initiates a "simple workflow"
2. Decider gets a task. Decider decides to give it to a Customer
Reminder activity worker or PUT IT TO SLEEP.
3. The decider keeps poling but never gets the workflow till the sleep
time of work flow is over.
4. The sleep time is over so SWF starts giving it the decider which has
been polling all along.
Please tell me if you need any more clarification on this.
Use StartTimerDecision to create a timer.
Refer to the timer documentation.
http://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-dg-timers.html

How to kill /re-start a long running task

Is there a way to kill / re-start a long running task in AWS SWF? Sometimes some of our tasks run for a longer duration and we would like to manually kill a certain task (either via UI or programmatically) and re-start the task if possible. How to achieve this?
Console is option to manually kill workflow.
You can also set timeouts to whole workflow execution time or to individual activities. This can be set when you register your activity or when you start your activity (defaultTaskStartToCloseTimeoutSecond).
It's not clear what language you're using.
If you're using java, then you should look into Exponential Retry in Flow Framework. This make SDK restart your activity if it fails.
Long running activity is expected to heartbeat using RecordActivityTaskHeartbeat. It leads to timeout failure after short hearbeat interval instead of long task execution timeout if the activity process hangs or crashes.
The workflow code (decider) can always request activity cancellation through RequestCancelActivityTask decision. The cancellation request is returned as output of the RecordActivityTaskHeartbeat call. Activity implementation should cancel itself and report back to the service using RespondActivityTaskCanceled API call.
See Error Handling section of AWS Flow Framework Developer Guide for the AWS Flow Framework way of cancelling activities.
Sometimes activity implementation cannot support heartbeating and self cancellation. The solution is to execute another kill activity that terminates the first activity execution. For example under Unix such kill activity could emit "kill -9" command for the process that implements the first one.