Creating a scalable and fault tolerant system using AWS ECS

Creating a scalable and fault tolerant system using AWS ECS - concurrency

We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?

In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.

Related

Running large jobs with low startup time and autoscaling for bursts of traffic

For a website I’m developing on AWS, a user can submit a large job (ex. select a large number of items and ask to update them all in some way). We don’t want to limit the size of the job these users are submitting so this job can can in theory run for a very long period of time and require a large amount of memory (this rules out AWS Lambda as a compute engine option). We want jobs to be as independent from one another as possible so we chose to run each job in its own container in Amazon ECS. What we currently do when a user submits a job request is send a message with a job id/reference to an SQS queue, have AWS lambda poll that queue and upon receiving a message, lambda starts an ECS task (SQS -> Lambda -> ECS). This has the problem that a new ECS task is started with each request, so a new container must be booted up which can take minutes. This latency is directly visible to the user and is particularly unacceptable if the users job is not even particularly large yet they still wait for minutes for the container to boot up. Additionally, the cost of constantly running container or two would not be too problematic.
I've been toying with some ideas for updating this flow.
Attempt 1:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
processMessage(message);
// task finishes, and container can be brought down
We remove the lambda from the flow and just have SQS -> ECS rather than SQS -> Lambda -> ECS. In this case, there would be no cold start assuming a container is up spinning for messages. We could set the minimum number of tasks we want running to be a number > 0 to ensure all messages are processed at some point. However this suffers from the problem that it would not auto-scale as the number of messages in the queue increases. So something needs to spawn more containers when traffic increases.
Attempt 2:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
This comes with the issue that we could end up over provisioning containers if multiple containers see that there are more messages in the queue than tasks running. Since these tasks run forever until a message is processed this could result in a large unnecessary cost. It could also under provision containers - if the task sees that number of running tasks >= number of messages, but these running tasks are already busy processing messages, these tasks will not end up taking one of these messages out of the queue and we may end up with messages that have to wait a very long time to be processed.
Attempt 3:
message = null;
while (message == null) {
message = pollForMessages();
If (# of containers > min provisioned && this particular container has been running longer than some timeout) {
// finish this task so this container can be brought down
return;
}
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
While this may save us some cost compared to Attempt 2 so over provisioning wouldn’t be so much of an issue, there is still the possibility that we could under provision containers, in which case certain job requests would need to wait for potentially long periods of time before being processed.
Attempt 4:
We can introduce locking (ex. https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) to mitigate some of the race conditions, however we'll always have the issue that a task running does not necessarily mean a task that is available to pick up messages and Fargate gives us no way of distinguishing between these, which makes it difficult to determine how many containers to provision (ex. we see there are 5 running containers and 5 messages, but we don't know whether to provision more containers or not because we don't know if those containers are already processing a message or if they're waiting). Alternatively we could introduce some mechanism, either an external orchestrator or some logic within the containers and some data store, to manage the state of these containers.
Essentially to deal with each of these problems, the architecture becomes more and more complex and implementation would be difficult and error prone.
It also seems to me like these solutions are reinventing the wheel, and I feel there must be some service out there that has solved this problem already, but I can’t seem to find it.
The suggestions I’ve seen to deal with this are:
Maybe AWS batch is more suited for this use case - Indeed, AWS batch might be the more recommended approach for a workload like this but, we don’t remove any of the cold start problem by switching. AWS batch would still create a new container with each job.
Run the ECS tasks on EC2 rather than Fargate, then cache the container image on the host - With this, we’d be managing our own infrastructure and ideally we’d like this to be serverless.
Have an alarm on the number of messages in the queue and have this alarm trigger a lambda that then boots up more containers - alarms on the /AWS log group have a minimum period of 1 minute. This means the alarm would not be triggered until a minute after we’d received more requests than our provisioned containers can handle. Additionally we'd have to set up many alarms to scale at different numbers of messages.
I’m wondering if anyone is aware of potential services/frameworks that could make doing this more feasible? Or if anyone has suggestions on alternative architectures?

If you don't mind a bit slower response time to the bursts, you may create an autoscaling group (I assume there is something similar for ECS). This group can be governed by a custom metric, e. g. queue length divided by the number of workers. A detailed guide is here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
In any case, I'd decouple the scaling decision from the worker code, because there is a varying number of workers that you would need to synchronize. It's much easier to have one overseer that controls how many workers there should be. Because the overseer is not on the critical path to task processing, you don't need to care that much about its uptime. It's OK if it takes a few minutes before it recovers after a failure - the workers are still there, processing at least at some capacity.

What are the possible use cases for Amazon SQS or any Queue Service?

So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?

No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.

One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.

SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.

Distributed Priority Queue, once and only once

TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.

Quartz Job throttling

Little bit wiered requirement:
I have a few quartz jobs that are acting as data collectors, collects data from different locations as and when available. Then I have another job [data load] which is being called/triggered from the collector jobs to update my DB.
My requirement is to some how throttle Load Job to have only two instances running in parallel and handle the work coming from the collector jobs
Collector Jobs 1,2,...N > Loader Job (two instances)
Job programs are deployed in clusted Tomcat.
Two Questions:
1) How can I make the Collector jobs to wait, when two instances of the Loader job already in process? Is there any way to use the quartz program to impelement FIFO logic to throttle the work to Loader job? I also do not want the collector to pick up another data, if one is already waiting to be processed.
2) Is there any way to run a job with two threads only? No more than two instances should be active at a time? I have limitation on my DB table to run only two instances in parallel.

It's 8 years later and the question shows up as the top result in google when searching for job throttling. And while the case from question clearly begs for using a queue, the actual answer was never given.
So... To throttle jobs in quartz one has to use TriggerListener and implement the throttling in vetoJobExecution. The job itself can also be annotated to prevent concurrent executions with #DisallowConcurrentExecution.

It seems you have a producer-consumer situation here.
The producer and the consumer are usually separated by a queue. Have your collectors put items into a (persistent?) queue and have your Loader reading from the queue and dispatching up to 2 handling threads.

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras

There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html

I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.

Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.

SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js