How to implement SWF exponential retries using the aws sdk - amazon-web-services

I'm trying to implement a jruby SWF activity worker using AWS SDK v2.
I cannot use the aws-flow-ruby framework since it's not compatible with jruby(forking), so I wrote a worker that uses threading.
https://github.com/djpate/jflow if people are interested.
Anyway, in the framework they implement retries and It seems that it actually schedules the same activity later if an activity failed.
I found everywhere in the AWS docs and cannot find how to send that signal back to SWF using the SDK http://docs.aws.amazon.com/sdkforruby/api/Aws/SWF/Client.html
Anyone know where I should look?

From the question, I believe you are somewhat confused about what SWF is / how it works.
Activities don't run and are not retried in isolation. Everything happens in the context of a workflow. The workflow definition tell you when to retry and how to behave if activities fail/timeout etc.
The worker that processes the workflow definition and schedules the next thing that needs to happen is referred to as a decider. (you will see decider and workflow used interchangeably). It's called a decider because based on the current state it makes the decision on what the next activity that needs to be scheduled is. The decider normally takes the workflow history as input when making this input.
In Flow for example, the retry is encoded in the workflow logic. Basically if the activity fails you can just schedule it.
So to finally answer your question: if your target is to only implement the activity workers you don't need to implement any retry logic as that happens at the decider level. You should make sure that the activities are compatible with the decider (you need to make sure the history and the input/output convention are the same).
If your target is to implement your own framework on top of SWF you need to actually do the hard work needed to make the decider work.

Related

How to add delay or Thread.sleep() in script task or how to delay the http task in flowable?

I am running flowable maven dependency as a spring boot project (This project has flowable maven dependency and the bpmn model alone).
There is another micro-service (wrapper service) to access the flowable REST APIs to initiate the process and updating the tasks.
I am running a http task and make it as a loop, and keep on checking the count. if the count satisfies, I will end the process. Else, it will loop around the http task. The use case is, I cannot determine when the count will be met.(It might even take days).
Here I cannot have the provision to use Java Service Task.
How can I handle this scenario in bpmn model? or is there any other approach to follow? Please advice.
You can let your check complete, then check with an xor gateway if the count is reached. If yes, you continue with the regular process. If not, you continue with an intermediate timer event on which you define a wait time. After the specified time the token will continue and you loop back into the the checking service task.
Only use this approach if you the number of loops will be small. It is not a good patter to use if the loop is executed every few seconds, potentially over days. This it create a large instance tree and much audit information in the DB.
In such a case you can work with an external job scheduler such as Quartz and an asynchronous integration pattern.
Also see:
https://www.flowable.com/open-source/docs/bpmn/ch07b-BPMN-Constructs/#timer-intermediate-catching-event
or
https://docs.camunda.io/docs/next/components/modeler/bpmn/timer-events/

How to complete a service task using camunda rest api

I am using Camunda workflows to automate various processes. I have come across a scenario where the process is not moving from a service task. Usually, we call the task/{taskid}/complete to complete the task, but since the process is stuck on a service task, I am not able to complete that task. Can anybody help me find a way to complete the service task?
You are using a service task. That basically means "a machine should do something". The "normal" implementation is to provide code (a java Delegate or a connector endpoint) that is called by the process engine to execute this task.
The alternativ is to use the "external task" pattern. Think of external tasks as "user tasks for computers". So the process waits, tells subscribed clients that a job is to be done and waits for their completion.
I suppose your process uses the second option? (you can check in the modeler under "Implementation"). So completion can be done through the external task API, see docs.
/external-task/{id}/complete
If it is a connector then you likely will see when checking the log that retries have occurred and that the transaction rolled back. After addressing the underlying issue the service task (email) should be sent without explicitly triggering the service task and the following user task (Approval) should be created.

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

Is there a way to finish manual task synchronously (without waiting for async result) if some precondition is satisfied?

I am using AWS SWF and flow framework. I wanted to make my activities idempotent so that a workflow can be restarted from the beginning after any failure. Many of the activities are manual tasks (#ManualActivityCompletion) which need to be completed asynchronously.
Is there a way to finish manual tasks like normal tasks if I know that it is already complete? This way a new manual task will not be scheduled everytime the workflow is retried.
Or, is there a way to retry a workflow so that it starts from the point it failed?
Currently there is no way to override activity completion behavior at runtime. The work around is to complete activity using ManualActivityCompletionClient from within activity implementation.
There is no supported way to retry workflow to start from the point of failure.

In Amazon SWF, can I abuse a Decision task to actually perform the work

I need Amazon SWF to distribute some work, make sure it's done asynchronously, make sure it's store in a reliable way and that it's automatically restarted. However, the workflow logic I need is extremely simple: it's just to get a single task executed.
I implemented it now the way it's supposed to be done:
Request workflow execution
Decider founds out about it and schedules an activity
Workers finds out about the activity request, performs the results and returns the results
Decider notices a result and copies it over in a workflow completion
It seems to me that I can just have the decider do the work – as it were – and complete the workflow execution immediately. That would take care of a lot of code. (The activity might also fail, timeout, etc. All things that I currently need to cater for.)
So back to my question: can I have a decider that performs the work itself and completes the 'workflow' immediately?
Yes. Actually, I think you came up with an interesting use case: using a minimal workflow as a centralized locking mechanism for one-off actions in a distributed system - such as cron jobs executed from a single host in a fleet of many (the hosts have to first undergo election and whichever wins the lock gets to execute an action). The same could be achieved with Amazon SWF and minimum amount of code:
A small Python example, using boto.swf (use 1. from this post to setup the domain):
To code the decider:
#MyDecider.py
import boto.swf.layer2 as swf
class OneShotDecider(swf.Decider):
domain = 'stackoverflow'
task_list = 'default_tasks'
version = '1.0'
def run(self):
history = self.poll()
if 'events' in history:
decisions = swf.Layer1Decisions()
print 'got the decision task, doing the work'
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return False
return True
To start the decider:
$ ipython -i decider.py
In [1]: while OneShotDecider().run(): print 'polling SWF for decision tasks'
Finally, to start the workflow:
$ ipython
In [1]: wf_type = swf.WorkflowType(domain='stackoverflow', name='MyWorkflow', version='1.0', task_list='default_tasks')
In [2]: wf_type.start()
Out[2]: <WorkflowExecution 'MyWorkflow-1.0' at 0x32e2a10>
Back in the decider window, you you'll see something like:
polling SWF for decision tasks
polling SWF for decision tasks
got the decision task, doing the work
If your workflow is likely to evolve its business logic or grow in the number of activities, it's probably best to stick to the standard way of having Deciders doing the business logic and Workers solving the tasks.
While yes, you can do this (as pointed out by the other answer), there are some things to consider before doing so:
Why are you using SWF to execute this task? Why bother setting it up as a workflow and paying for "StartWorkflow" executions if you can get the same benefit by just invoking your code more directly? If you need to track execution submissions and completions, you can just use an SQS queue for this and get the same results for cheaper.
Your workflows might be extremely simple right now, but they often can and do evolve to be more complex over time. Designing it right from the start can save time in the long run. Do you want future developers working on your code thinking that they should just add more logic to the workflow? Will they know to lookup how to use activities, or just follow the existing pattern you've started with? (Hint - they'll be likely to copy your pattern - developers are lazy :))