AWS SWF Signal during vs Decision after Timer best practice? - amazon-web-services

I have a set of business processes that I think are a good fit for AWS SWF.
Several of these processes include wait periods, that could be anything from a week to 3 months. A (brief and not fully explained) example might be along the lines of "If a user signs up to a particular service, if they are still subscribing after 4 months, send them some form of reward".
I'm looking at modelling this by having the sign up process start off a workflow that then set a timer for the 4 month wait period.
The problem exists with the fact that if the subscriber cancels their subscription within that 4 month period, we don't want to send the reward.
I can see two ways of doing this: Have a "cancel" signal upon cancellation (that would stop the "sleeping" workflow), or having a "check subscription" decision before the "send reward" step (ie, after the workflow "wakes up"). (Obviously I could also do both, for a "belt & braces" approach)
Are there any recommended best practices here? There is the potential for there to be several tens of thousands of these various business processes that could be active or sleeping at any one time.

I would go with both approaches. Cancelling workflow through a signal or the RequestCancelWorkflowExecution and checking for subscription validity (using separate activity) before calling "send reward" activity. Implementing just latter approach is simpler but you end up paying for outstanding workflows that are technically cancelled. SWF certainly can handle tens of thousands of open workflows without problem.

Related

Throttled Queue Service

I have a function doWork(id) that I'm offloading to some worker servers using AWS SQS. This function can get called very frequently but I'd like to throttle the function so that for a given id, the work is don't no more than once per second.
Is it possible with AWS / are there any services that feature this functionality?
EDIT: Some clarification.
doWork(id) does some expensive work on a record in a database. This work needs to continuously update whenever the user interacts with the record. Thus, I call doWork(id) whenever the user called a method that edits the record. However, the user may edit the record many times very quickly (I'm building a text editor so every character is an edit). Rather than doWork(id) a unnecessary amount of times, I'd like to throttle that work so it happens at most once per second.
Because this work is expensive, I enqueue a message in SQS and have a set of "worker" servers that dequeue tasks and run them.
My goal here is to somehow maintain the stateless horizontal scalability of my servers while throttling doWork(id). To make matters a little more complicated, I don't want to throttle the doWork function itself -- I want to throttle the work for each individual record identified by the id passed to doWork.
You could use a Redis instance on ElastiCache and configure your workers to use a distributed rate limiter for keys based on id. There are also many packages for different languages based on this kind of idea that might be ready to run on your workers.
That's interesting. You want to delay the work in case they hit another key within a given time period. If they don't hit another key in that time period, you then want to do the work. You might also want to do it after x seconds even if they continue typing (Auto Save).
The problem is that each keypress sends a message to the queue. When a worker receives the message, they have no idea whether another key has been pressed since the message was sent, and there's no way to look in the queue for other matching messages.
Amazon SQS does have the ability to delay a message, which means it will not be available for receiving for a given period, but this alone can't solve the problem because the worker doesn't know what else has happened.
Bottom line: A traditional queue is not a suitable mechanism for this use-case. You need something akin to a database/cache that can update a "last modified" timestamp each time that a key is pressed. Once that timestamp is more than x seconds old, you should queue the worker.

What are the possible use cases for Amazon SQS or any Queue Service?

So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?
No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.
One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.
SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.

When to use delay queue feature of Amazon SQS?

I understand the concept of delay queue of Amazon SQS, but I wonder why it is useful.
What's the usage of SQS delay queue?
Thanks
One use case which i can think of is usage in distributed applications which have eventual consistency semantics. The system consuming the message may have an dependency like a co-relation identifier to be available and hence may need to wait for certain guaranteed duration of time before seeing the co-relation data. In this case, it makes sense for the message to be delayed for certain duration of time.
Like you I was confused as to a use-case for delay queues, until I stumbled across one in my own work. My application needs to have an internal queue with each item waiting at least one minute between each check for completion.
So instead of having to manage a "last-checked-time" on every object, I just shove the object's ID into an SQS queue messagewith a delay time of 60 seconds, and my main loop then becomes a simple long-poll against the queue.
A few off the top of my head:
Emails - Let's say you have a service that sends reminder emails triggered from queue messages. You'd have to delay enqueueing the message in that case.
Race conditions - Delivery delays can be used to overcome race conditions in distributed systems. For example, a service could insert a row into a table, and sends a message about its availability to other services. They can't use the new entry just yet, so you have to delay publishing the SQS message.
Handling retries - Sometimes if a message fails you want to retry with exponential backoffs. This requires re-enqueuing the message with longer delays.
I've built a suite of API's to make queue message scheduling easy. You can call our API's to schedule queue messages, cancel, edit, and check on the status of such messages. Think of it like a scheduler microservice.
www.schedulerapi.com
If you are looking for a solution, let me know. I've built these schedulers before at work for delivering emails at high scale, so I have experience with similar use cases.
One use-case can be:
Think of a time critical expression like a scheduled equity trade order.
If one of your system is fetching all the order scheduled in next 60 minutes and putting them in queue (which will be fetched by another sub system).
If you send these order directly, then they will be visible immediately to process in queue and will be processed depending upon their order.
But most likely, they will not execute in exact time (Hour:Minute:Seconds) in which Customer wanted and this will impact the outcome.
So to solve this, what first sub system will do, it will add delay seconds (difference between current and execution time) so message will only be visible after that much delay or at exact time when user wanted.

Using Timers/Signals to allow human intervention in AWS SWF Workflow

Here's the scenario. A user uploads an Excel file and this kicks off a workflow which validates the file, transforms it into a few different files, then performs an update to a database based on the transforms. After the uploads, the results need to be reviewed by team member before the flow can continue.
I'm using Ruby and have discovered that Signals and Timers are the way to achieve this in SWF. However, the Ruby examples are lacking or non-existent and I need a little help understanding how this would work using Ruby.
Ny understanding so far is that a Timer activity is scheduled which basically pauses the flow until either the timer expires (at which point I could cancel the workflow or email the staff and set another timer) or a signal is sent to the workflow to start the next step. The Decider would handle the signal and then kick off the appropriate activity.
Any thoughts or direction to other sources would be much appreciated.
Thanks,
Thomas
It's somewhat difficult to provide an "answer", given you didn't really ask a specific question. I'm in agreement with you that using a Timer and Signals is what you want.
You don't specify how the team gets notified about the review. I'll assume that you notify them by email and direct them to some website where they can review the changes, and then click on a link to either Approve or Don't Approve. Clicking the link to Approve will send a request to a web server that will "signal" SWF that the review has been approved. Clicking the link to Don't Approve will "signal" SWF that the review has not been approved. You mention that you want to renotify the team (or perhaps escalate to the manager) if no one has taken action on the review. Let's say this renotification happens after 48 hours. After the renotication, you grant them another 72 hours before assumming Don't Approve.
Here's how your workflow looks like to me:
User uploads file and kicks off a workflow
Decider Task schedules "TransformActivity"
TransformActivity runs, transforms the data into different files, and completes successfully
Decider Task schedules "UpdateDatabaseActivity"
UpdateDatabaseActivity runs, updates the database, and completes successfully
Decider Task schedules "EmailTeamActivity"
EmailTeamActivity runs, emails the team, and completes successfully
Decider Task schedules a Timer for 48 hours.
If a signal indicating Approve or Don't Approve is received within 48 hours:
Decider Task schedules the "RecordFinalDecisionActivity"
RecordFinalDecisionActivity will run, record the Approve (or Don't Approve) into the database, and complete successfully.
Decider Task will then close the workflow because it's done.
If no signal is received and the timer fires (after 48 hours):
Decider Task schedules the "EmailTeamAndManagerActivity"
EmailTeamAndManagerActivity runs, emails the team and manager, and completes successfully.
Decider Task schedules another timer for 72 hours.
If a signal indicating Approve or Don't Approve is received within the additional 72 hours given:
Repeat the same logic as the section "If a signal indicating Approve or Don't Approve is received within 48 hours".
If no signal is received and the timer fires (after the additional 72 hours):
At this point, the workflow can assume it was a Don't Approve, schedule the "RecordFinalDecisionActivity" and close the workflow once that activity completes.
The reason why you don't want to have a "review" activity is because that task gets scheduled and then some activity worker needs to reply success. How would that work? When someone clicks the Approve or Don't Approve link, the request to the webserver would have to pull down the activity from the task list. However, if the task list has multiple activities, SWF just gives out any one of them. It might not get the right one. Now, you could argue that you could schedule the different reviews across different task lists, but that's just cumbersome and tedious.
Signals are done to indicate an "external" event, which this very much is. The SWF documentation on Signals does a great job on talking about Signals. Here's the SWF documentation on how to use Timers and Signals. As for the particulars on how to use SWF and Ruby, I can't really help you there. I've only used SWF with Java by using the AWS Flow Framework.
user upload excel file, does "StartWorkflowExecution", that queues a decision task
decision worker notice flow is new / "stage one", it schedules "transform file" activity task
activity worker picks up task, and does the "transform file" activity, when done does "RespondActivityTaskCompleted" with a result of "transformations done", that queues a decision task
decision worker picks up decision task, notices the transformations are done and schedule a new activity task
activity worker picks up activity task, notices it's for a team member (according to the instructions given by the decision worker when scheduling the activity task), team member gets notified, somehow perform his action, then somehow notifies the activity worker which will reply "RespondActivityTaskCompleted"
I don't see the need for a Timer or a Signal, it's just plain flow. Those two concepts are useful if you want recurring events, timeouts, and/or interrupting the flow.
Please note that you can differentiate activity workers by using task lists (for example activity workers for automated work vs activity workers for human participants, whatever).

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras
There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html
I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.
Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.
SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.