How to ensure once-only processing of data in an AWS serverless architecture? - amazon-web-services

I have some data that needs to be processed at a point in time.
My current strategy is to pull the data every minute and load it into a queue and process it.
I have two concerns with this strategy:
I can't guarantee that the last minute captures all data so I pull the last two minutes; and
Lambdas as far as I know can fire multiple times depending on the trigger (in this case SQS.)
I'm trying to avoid writing a flag to the data because of the spikey nature of batch processing.
The only other solution I can think of is using S3 to create a lock-file.
Is there a better way to 'kick off' future events? Is there a strategy outside database and S3 flags?

Have a look at SQS FIFO Queues, they are designed to deliver once and only once.
You can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once. ...source

Related

Throughput in Standard SQS vs FIFO SQS with a unique groupId for every message

I do not care much about the order of events but I would like the message to be processed exactly once. The lambda listening to SQS messages will store it in DynamoDB so throughput is pretty important as I have multiple microservices (as producers) writing messages to this SQS that will be read by a single microservice.
About processing messages exactly once, that is something that FIFO queue supports but is said to have not a good throughput.
Is the throughput of the FIFO queue the same as the Standard queue if each message has a unique groupId?
If not, my next option is probably to use "attribute_not_exists" in DynamoDB while storing the message.
Which of these should work better?
Messages / sec
FIFO
30,000 messages (with batching + high throughput mode)
3,000 messages (without batching + high throughput mode)
3,000 messages (with batching)
300 messages (without batching)
Standard
Nearly unlimited
https://aws.amazon.com/sqs/faqs/
To process exactly once, you need to use FIFO queue with de-deplication ID.
If your throughput requirement is below the limit mentioned above, then you're fine with the FIFO queue.
If not then, using DynamoDB as your original plan is also an alternative option. But you have to manage a lot of things yourself here with this approach like deleting the message, updating if the message is being read but not yet fully processed, and so on.
FIFO SQS queues have different rate limits than a regular SQS queue regardless of the use of message group ids
SQS Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
FIFO SQS supports 300 TPS for each API method
Look at the quota docs here
Also, AWS has a new feature for higher throughput FIFO SQS queue which might interest you
With batching of maximum 10 messages per API call you can handle 3,000 messages per second with FIFO queue
Regarding making sure you don't handle the same message twice - have you had a look at FIFO de-duplication ID? I am not sure if that's exactly what you need but it sounds pretty similar to your requirement
SQS delivery guarantee is at least once. Your application must be designed to handle processing duplicate messages.
I'd strongly recommend building your application this way.
If you must process some type of data exactly once, you need a strongly consistent system. Consider using dynamodb and conditional updates

I want to know when a batch of messages has completed in a AWS SQS Queue

I think this is more of a 'architecture design' question.
I have a lambda producer that will put ~600 messages on a SQS queue (there are multiple producers) as a batch (so not 1 message with a body of ~600 messages). A consumer lambda that will take individual messages and deal with them (at scale). What I want to do is run another lambda when each batch is complete.
Initial ideas was to create a 'unique batch number', a 'total batch number' and a 'batch position number' and add it to the messages attributes for every message. And then in the consumer lambda check the these to decide if the batch is complete.
But does that mean I would need to use a FIFO queue and partition on the batch number and only have one lambda consumer per batch. Or do I run some sort of state management in DynamoDB (is the a pattern out there for this? please guide me on this).
Regards, J
It seems like the goal is to achieve Fork-Join capabilities in a distributed system. One way to handle this in AWS is using Step Functions. Assuming a queue service needs to be used, state of the overall operation will need to be tracked. Some ways to do this are:
Store state of the overall operation in a DB.
Put a 'terminatation' message in the queue after all others and process FIFO.
Create a metadata service which receives 'start' and 'stop' messages for each service and handles them accordingly.
Reference: Fork and Join with Amazon Lambda

Kinesis- like sharding with SQS?

I'm interested in replacing Kinesis (because it's expensive and I don't need the historic log) with SQS, but I need a sharding/partitioning mechanism, specifically when processing with Lambda.
I see SQS FIFO queues have recently acquired Lambda event mapping -
https://aws.amazon.com/blogs/compute/new-for-aws-lambda-sqs-fifo-as-an-event-source/
which I think brings partitioning tantalisingly close via the use of MessageGroupID.
Message processing is described as proceeding via the following rules -
1) Return the oldest message where no other message with the same MessageGroupId is in flight.
2) Return as many messages with the same MessageGroupId as possible.
3) If a message batch is still not full, go back to the first rule. As a result, it’s possible for a single batch to contain messages from multiple MessageGroupIds.
1) and 2) sound great - each Lambda request batch containing a single MessageGroupID only - but then 3) seems to mess it all up :-(
Any thoughts on a strategy to ensure every Lambda request batch only contains messages from a single MessageGroupID ? Possibly via MessageDeduplicationID ?
Suspect the answer here is just to use one queue per “partition” as SQS pricing works on a per-message basis, not per-queue. If you have a lot of “partitions” then create them programmatically during the stack spin up process (e.g. as part of a CodeBuild script), rather than defining each and every queue in CloudFormation.

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

Single SQS Queue vs Multiple SQS Queue while creating a Async Model

I have to develop a component where the Apis are async in nature. In order to develop this async model, I am going to use Aws SQS queues for publishing messages and the client will read from the queue and send the response back into the queue. Now there are 10 APIs (currently) that I have to expose.
Currently, I can think of having a single request and a single response queue (which I will poll) for all the APIs and the payload of the APIs can be defined by some Operation.
The other way is to use a separate queue for each API. The advantage that I can see for multiple queues is that each API can have different traffic and having multiple queues can help the client of the queues to scale effectively.
What can be other pros or cons for both the approaches?
Separate your use-case into 2 distinct problems:
Problem 1: APIs to Workers, one queue or multiple?
If your workers do different types of work, then having a single queue will require them to inspect then discard messages they don't care about. If this is the case, then you should have one queue per message type. This way, any message a worker receives from the queue, it should be able to handle.
If you start ignoring messages, then other workers, who may be idle, may be waiting for a while for messages it cares about.
Problem 2: Using a return queue for the "results". If your clients will be polling for results, then at each poll, your API will need to poll the queue. Again, it will be "searching" for the right response, discarding those it doesn't care about, starving other clients.
Recommendation:
Use multiple queues, one per "worker type". Workers should be able to process any message it receives from the queue.
Then use something other than SQS to store the result. One option is to use S3 to store the result:
When your API "creates" the task, create an object in S3 and put a reference to that S3 object on your SQS queue.
Your worker will do the work, then put the result where it was told to.
When your client polls your API for the result, your API will check S3 and return the status/results.
Instead of S3, other data stores could be used if appropriate: RDS, DynamoDB, etc.