Does AWS SWF let work flow to sleep some time?

Does AWS SWF let work flow to sleep some time? - amazon-web-services

Every time a customer completes a transaction a reminder workflow starts for that customer which tries to remind a customer about few actions that he/she has to perform. There are points in the flow where I know for sure that there is no task to be performed. So in this case I want the workflow to go to sleep for some time and come back to life later. I want this sleep feature to avoid database call as my decider does one database query every time it gets a task.
I have gone through the AWS documentation here . But found nothing there (Please point me to document if the feature exists). Does AWS-SWF provide such a feature. If it does not provide a feature of this type then what is smart and clean way of doing this.
A small example of flow I want to create :
1. End of transaction initiates a "simple workflow"
2. Decider gets a task. Decider decides to give it to a Customer
Reminder activity worker or PUT IT TO SLEEP.
3. The decider keeps poling but never gets the workflow till the sleep
time of work flow is over.
4. The sleep time is over so SWF starts giving it the decider which has
been polling all along.
Please tell me if you need any more clarification on this.

Use StartTimerDecision to create a timer.

Refer to the timer documentation.
http://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-dg-timers.html

Related

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).

We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.

OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

How to implement SWF exponential retries using the aws sdk

I'm trying to implement a jruby SWF activity worker using AWS SDK v2.
I cannot use the aws-flow-ruby framework since it's not compatible with jruby(forking), so I wrote a worker that uses threading.
https://github.com/djpate/jflow if people are interested.
Anyway, in the framework they implement retries and It seems that it actually schedules the same activity later if an activity failed.
I found everywhere in the AWS docs and cannot find how to send that signal back to SWF using the SDK http://docs.aws.amazon.com/sdkforruby/api/Aws/SWF/Client.html
Anyone know where I should look?

From the question, I believe you are somewhat confused about what SWF is / how it works.
Activities don't run and are not retried in isolation. Everything happens in the context of a workflow. The workflow definition tell you when to retry and how to behave if activities fail/timeout etc.
The worker that processes the workflow definition and schedules the next thing that needs to happen is referred to as a decider. (you will see decider and workflow used interchangeably). It's called a decider because based on the current state it makes the decision on what the next activity that needs to be scheduled is. The decider normally takes the workflow history as input when making this input.
In Flow for example, the retry is encoded in the workflow logic. Basically if the activity fails you can just schedule it.
So to finally answer your question: if your target is to only implement the activity workers you don't need to implement any retry logic as that happens at the decider level. You should make sure that the activities are compatible with the decider (you need to make sure the history and the input/output convention are the same).
If your target is to implement your own framework on top of SWF you need to actually do the hard work needed to make the decider work.

In Amazon SWF, can I abuse a Decision task to actually perform the work

I need Amazon SWF to distribute some work, make sure it's done asynchronously, make sure it's store in a reliable way and that it's automatically restarted. However, the workflow logic I need is extremely simple: it's just to get a single task executed.
I implemented it now the way it's supposed to be done:
Request workflow execution
Decider founds out about it and schedules an activity
Workers finds out about the activity request, performs the results and returns the results
Decider notices a result and copies it over in a workflow completion
It seems to me that I can just have the decider do the work – as it were – and complete the workflow execution immediately. That would take care of a lot of code. (The activity might also fail, timeout, etc. All things that I currently need to cater for.)
So back to my question: can I have a decider that performs the work itself and completes the 'workflow' immediately?

Yes. Actually, I think you came up with an interesting use case: using a minimal workflow as a centralized locking mechanism for one-off actions in a distributed system - such as cron jobs executed from a single host in a fleet of many (the hosts have to first undergo election and whichever wins the lock gets to execute an action). The same could be achieved with Amazon SWF and minimum amount of code:
A small Python example, using boto.swf (use 1. from this post to setup the domain):
To code the decider:
#MyDecider.py
import boto.swf.layer2 as swf
class OneShotDecider(swf.Decider):
domain = 'stackoverflow'
task_list = 'default_tasks'
version = '1.0'
def run(self):
history = self.poll()
if 'events' in history:
decisions = swf.Layer1Decisions()
print 'got the decision task, doing the work'
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return False
return True
To start the decider:
$ ipython -i decider.py
In [1]: while OneShotDecider().run(): print 'polling SWF for decision tasks'
Finally, to start the workflow:
$ ipython
In [1]: wf_type = swf.WorkflowType(domain='stackoverflow', name='MyWorkflow', version='1.0', task_list='default_tasks')
In [2]: wf_type.start()
Out[2]: <WorkflowExecution 'MyWorkflow-1.0' at 0x32e2a10>
Back in the decider window, you you'll see something like:
polling SWF for decision tasks
polling SWF for decision tasks
got the decision task, doing the work
If your workflow is likely to evolve its business logic or grow in the number of activities, it's probably best to stick to the standard way of having Deciders doing the business logic and Workers solving the tasks.

While yes, you can do this (as pointed out by the other answer), there are some things to consider before doing so:
Why are you using SWF to execute this task? Why bother setting it up as a workflow and paying for "StartWorkflow" executions if you can get the same benefit by just invoking your code more directly? If you need to track execution submissions and completions, you can just use an SQS queue for this and get the same results for cheaper.
Your workflows might be extremely simple right now, but they often can and do evolve to be more complex over time. Designing it right from the start can save time in the long run. Do you want future developers working on your code thinking that they should just add more logic to the workflow? Will they know to lookup how to use activities, or just follow the existing pattern you've started with? (Hint - they'll be likely to copy your pattern - developers are lazy :))

Using Timers/Signals to allow human intervention in AWS SWF Workflow

Here's the scenario. A user uploads an Excel file and this kicks off a workflow which validates the file, transforms it into a few different files, then performs an update to a database based on the transforms. After the uploads, the results need to be reviewed by team member before the flow can continue.
I'm using Ruby and have discovered that Signals and Timers are the way to achieve this in SWF. However, the Ruby examples are lacking or non-existent and I need a little help understanding how this would work using Ruby.
Ny understanding so far is that a Timer activity is scheduled which basically pauses the flow until either the timer expires (at which point I could cancel the workflow or email the staff and set another timer) or a signal is sent to the workflow to start the next step. The Decider would handle the signal and then kick off the appropriate activity.
Any thoughts or direction to other sources would be much appreciated.
Thanks,
Thomas

It's somewhat difficult to provide an "answer", given you didn't really ask a specific question. I'm in agreement with you that using a Timer and Signals is what you want.
You don't specify how the team gets notified about the review. I'll assume that you notify them by email and direct them to some website where they can review the changes, and then click on a link to either Approve or Don't Approve. Clicking the link to Approve will send a request to a web server that will "signal" SWF that the review has been approved. Clicking the link to Don't Approve will "signal" SWF that the review has not been approved. You mention that you want to renotify the team (or perhaps escalate to the manager) if no one has taken action on the review. Let's say this renotification happens after 48 hours. After the renotication, you grant them another 72 hours before assumming Don't Approve.
Here's how your workflow looks like to me:
User uploads file and kicks off a workflow
Decider Task schedules "TransformActivity"
TransformActivity runs, transforms the data into different files, and completes successfully
Decider Task schedules "UpdateDatabaseActivity"
UpdateDatabaseActivity runs, updates the database, and completes successfully
Decider Task schedules "EmailTeamActivity"
EmailTeamActivity runs, emails the team, and completes successfully
Decider Task schedules a Timer for 48 hours.
If a signal indicating Approve or Don't Approve is received within 48 hours:
Decider Task schedules the "RecordFinalDecisionActivity"
RecordFinalDecisionActivity will run, record the Approve (or Don't Approve) into the database, and complete successfully.
Decider Task will then close the workflow because it's done.
If no signal is received and the timer fires (after 48 hours):
Decider Task schedules the "EmailTeamAndManagerActivity"
EmailTeamAndManagerActivity runs, emails the team and manager, and completes successfully.
Decider Task schedules another timer for 72 hours.
If a signal indicating Approve or Don't Approve is received within the additional 72 hours given:
Repeat the same logic as the section "If a signal indicating Approve or Don't Approve is received within 48 hours".
If no signal is received and the timer fires (after the additional 72 hours):
At this point, the workflow can assume it was a Don't Approve, schedule the "RecordFinalDecisionActivity" and close the workflow once that activity completes.
The reason why you don't want to have a "review" activity is because that task gets scheduled and then some activity worker needs to reply success. How would that work? When someone clicks the Approve or Don't Approve link, the request to the webserver would have to pull down the activity from the task list. However, if the task list has multiple activities, SWF just gives out any one of them. It might not get the right one. Now, you could argue that you could schedule the different reviews across different task lists, but that's just cumbersome and tedious.
Signals are done to indicate an "external" event, which this very much is. The SWF documentation on Signals does a great job on talking about Signals. Here's the SWF documentation on how to use Timers and Signals. As for the particulars on how to use SWF and Ruby, I can't really help you there. I've only used SWF with Java by using the AWS Flow Framework.

user upload excel file, does "StartWorkflowExecution", that queues a decision task
decision worker notice flow is new / "stage one", it schedules "transform file" activity task
activity worker picks up task, and does the "transform file" activity, when done does "RespondActivityTaskCompleted" with a result of "transformations done", that queues a decision task
decision worker picks up decision task, notices the transformations are done and schedule a new activity task
activity worker picks up activity task, notices it's for a team member (according to the instructions given by the decision worker when scheduling the activity task), team member gets notified, somehow perform his action, then somehow notifies the activity worker which will reply "RespondActivityTaskCompleted"
I don't see the need for a Timer or a Signal, it's just plain flow. Those two concepts are useful if you want recurring events, timeouts, and/or interrupting the flow.
Please note that you can differentiate activity workers by using task lists (for example activity workers for automated work vs activity workers for human participants, whatever).

implementing a timer in a django app

In my Django app, I need to implement this "timer-based" functionality:
User creates some jobs and for each one defines when (in the same unit the timer works, probably seconds) it will take place.
User starts the timer.
User may pause and resume the timer whenever he wants.
A job is executed when its time is due.
This does not fit a typical cron scenario as time of execution is tied to a timer that the user can start, pause and resume.
What is the preferred way of doing this?

This isn't a Django question. It is a system architecture problem. The http is stateless, so there is no notion of times.
My suggestion is to use Message Queues such as RabbitMQ and use Carrot to interface with it. You can put the jobs on the queue, then create a seperate consumer daemon which will process jobs from the queue. The consumer has the logic about when to process.
If that it too complex a system, perhaps look at implementing the timer in JS and having it call a url mapped to a view that processes a unit of work. The JS would be the timer.

Have a look at Pinax, especially the notifications.
Once created they are pushed to the DB (queue), and processed by the cron-jobbed email-sending (2. consumer).
In this senario you won't stop it once it get fired.
That could be managed by som (ajax-)views, that call system process....
edit
instead of cron-jobs you could use a twisted-based consumer:
write jobs to db with time-information to the db
send a request for consuming (or resuming, pausing, ...) to the twisted server via socket
do the rest in twisted

You're going to end up with separate (from the web server) processes to monitor the queue and execute jobs. Consider how you would build that without Django using command-line tools to drive it. Use Django models to access the the database.
When you have that working, layer on on a web-based interface (using full Django) to manipulate the queue and report on job status.
I think that if you approach it this way the problem becomes much easier.

I used the probably simplest (crudest is more appropriate, I'm afraid) approach possible: 1. Wrote a model featuring the current position and the state of the counter (active, paused, etc), 2. A django job that increments the counter if its state is active, 3. An entry to the cron that executes the job every minute.
Thanks everyone for the answers.

You can always use a client based jquery timer, but remember to initialize the timer with a value which is passed from your backend application, also make sure that the end user didn't edit the time (edit by inspecting).
So place a timer start time (initial value of the timer) and timer end time or timer pause time in the backend (DB itself).
Monitor the duration in the backend and trigger the job ( in you case ).
Hope this is clear.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js