Is SQS better than DynamoDB for peak loads? - amazon-web-services

A service runs on ECS and writes the requested URL to a DynamoDB. Dynamic scaling was activated to keep the costs for DynamoDB from becoming too high. DynamoDB scales slower than requests are coming in at any given time, so some calls are not logged. My question now is whether writing to an SQS would be the better way here, because the documentation says:
Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
Of course, the messages would then have to be written back to DynamoDB, but another service can then do that.
Is the throughput of messages per second to SQS really unlimited, so it's definitely cheaper to send messages to SQS instead of increasing DynamoDB's writes per second?

I don't know if this qualifies for a good answer. But remembering a discussion with my architect at the time, we concluded that to have a queue for precisely this problem seems good practice, regardless of load. It keeps requests even if services go down, so there is an added benefit.

SQS and Dynamo fit two very different use cases. Its not so much which is better, its which is right for what you need.
Dynamodb is a NoSQL Document based Database. This is best for when you have known access patterns to data that needs to persist over time, that you need to access quickly, but probably are not making many changes too (or at least the changes do not have to be absolutely immediately, sub 5 ms accessible). Each document in a dynamodb is similar (but also very different) to a row in a standard SQL table, in that it will have attributes (columns) keys (Partition and Sort Key) and be retrievable through a query (though dynamic on the fly queries are NOT good for Dynamo)
SQS is a Queue system. It has no persistence. Payloads of JSON objects are dropped into the Queue and then processed by some end point - either a Lambda, or put into a dynamo, or something else entirely depending on your products use case. It is perfect for when you often receive bursts of data but your system needs some time to handle each individual payload - such as it is waiting on other systems to finish before it can handle the next one - so instead of scaling horizontally (by just handling all the payloads in parallel) you have to scale vertically (be able to handle more payloads at once through a single or only a few threads). You cannot access the data coming in while it is waiting in the queue, no queries on said data, only wait until that data pops/pushes off the queue and into processing by whatever system you have set up to receive it.
The answer to your question is entirely dependent on your use case and your system - something we here at SO will never really understand or know simply because we will always be hearing about it through you and never really experiencing it. As such, to answer it, you need to understand the capabilities of both Dynamo and SQS, the pros and cons for each, and then determine which is best for your product.

Related

Is an SQS needed with a Lambda in this use case?

I'm trying to build a flow that allows a user to enter data and it's being stored in RDS. My question is do I need to go from USER -> SQS -> Lambda -> RDS to or is it better to go straight from USER -> Lambda -> RDS which skips the queue entirely. Are there going to be scalability issues with the latter?
I do like that the SQS can retry a large number of times to guarantee the data, but is there a similar way to retry with a lambda alone? It's important that all of the data is stored and done so in a timely manner. I'm struggling to see the tradeoffs of the two scenarios.
If anyone has any input on the situation, that would be amazing.
Are there going to be scalability issues with the latter?
It depends on multiple metrics including traffic, spikes, size of the database, rpm etc.
Putting SQS before lambda provides you to manage number of database queries in t time according to your needs. It is a "queue" and you are consuming that queue. In some business cases it may not be useful(banking transactions etc) but in some cases(analytic calculations) it may be helpful. Instead of making a single insert whenever lambda is invoked, you can set batch size and insert in batch(10 records at once) which reduces the number of queries.
Also you can define dead letter queue to push your problematic data(couldn't make it to database). It will be another queue that you to check later to identify problematic inputs. The document can be found here

Lambda Kinesis Stream Consumer Concurrency Question

I have a Lambda with an Event source pointed to a Kinesis Stream Consumer. The stream has 30 shards.
I can see requests coming in on the lambda console, and I can see metrics in the Enhanced Fan Out section of the kinesis console so it appears everything is configured right.
However, my concurrent executions of the lambda is capped at 10 for some reason and I can't figure out why. Most of the documentation suggests that when Enhanced Fan Out is used and a lambda is listening to a stream consumer, then one lambda per shard should be running.
Can anyone explain how concurrency with lambda stream consumers work?
I have a couple of pointers just in case. The first thing is to make sure your lambda concurrency limit is actually over 10. It should be, as it defaults to 1000, but it doesn't hurt to check.
About the explanation of how lambda stream consumers work, you have the details at the lambda docs.
One thing I've seen often with Kinesis Data Streams, is having trouble with the Partition Key of the records. As you probably know, Kinesis Data Streams will send all the records with the same partition key to the same shard, so they can be processed in the right order. If records were sent to any shard (for example using a simple round-robin) then you couldn't have any guarantee they would be processed in order, as different shards are read by different processors.
It's important then to make sure you are distributing your keys as evenly as possible. If most records have the same partition key, then one of the shards will be very busy while the others are not getting traffic. It might be the case you are using only 10 different values for your partition keys, in which case you would be sending data to only 10 shards, and since a lambda function execution will be connected to only one shard, you have only 10 concurrent executions.
You can know the shard Id you are using by checking the output of PutRecord. You can also force a shard ID by overriding the Hashing mechanism. There is more information about partition keys processing and record sorting at the SDK docs.
Also make sure you read the troubleshooting guide, as sometimes you can get records processed by two processors concurrently and you might want to be prepared for that.
I don't know if your issue will be related to these pointers, but the Key Partitioning is a recurrent issue, so I thought I would comment on it. Good luck!

AWS Lambda - Store state of a queue

I'm currently tasked with building a serverless architecture for communication between government agencies and citizens, and a main component is some form of queue that contains some form of object/pointer to each citizens request, sorted by priority. The government workers can then process an element when available. As Lambda is stateless, I need to save the queue outside in some manner.
For saving state I've gathered that you can use DynamoDB or S3 Buckets and use event triggers to invoke related Lambda methods. Some also suggest using Parameter Store to save some state variables. Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Finally, I've also read a bit about SQS, though I have no idea if it is at all applicable to this case.
What is the best-practice / suggested approach when working with Lambda in this way? I'm leaning towards S3 Buckets, due to event triggering, and not using DynamoDB as our DB.
Storing things globally has also come up, though as you can't guarantee that the Lambda doesn't terminate, it doesn't seem like a good idea.
Correct -- this is not viable at all. Note that what you are actually referring to when you say "the Lambda" is the process inside the container... and any time your Lambda function is handling more than one invocation concurrently, you are guaranteed that they will not be running in the same container -- so "global" variables are only useful for optimization, not state. Any two concurrent invocations of the same function have two entirely different global environments.
Forgetting all about Lambda for a moment -- I am not saying don't use Lambda; I'm saying that whether or not you use Lambda isn't relevant to the rest of what is written, below -- I would suggest that parallel/concurrent actions in general are perhaps one of the most important factors that many developers tend to overlook when trying to design something like you are describing.
How you will assign work from this work "queue" is extremely important to consider. You can't just "find the next item" and display it to a worker.
You must have a way to do all of these things:
finding the next item that appears to be available
verify that it is indeed available
assign it to a specific worker
mark it as unavailable for assignment
Not only that, but you have to be able to do all of these things atomically -- as a single logical action -- and without collisions.
A naïve implementation runs the risk of assigning the same work item to two or more people, with the first assignment being blindly and silently overwritten by subsequent assignments that happen at almost the same time.
DynamoDB allows conditional updates -- update a record if and only if a certain condition is true. This is a critical piece of functionality that your solution needs to accommodate -- for example, assign work item x to user y if and only if item x is currently unassigned. A conditional update will fail, and changes nothing, if the condition is not true at the instant the update happens and therein lies the power of the feature.
S3 does not support conditional updates, because unlike DynamoDB, S3 operates only on an eventual-consistency model in most cases. After an object in S3 is updated or deleted, there is no guarantee that the next request to S3 will return the most recent version or that S3 will not return an item that has recently been deleted. This is not a defect in S3 -- it's an optimization -- but it makes S3 unsuited to the "work queue" aspect.
Skip this consideration and you will have a system that appears to work, and works correctly much of the time... but at other times, it "mysteriously" behaves wrongly.
Of course, if your work items have accompanying documents (scanned images, PDF, etc.), it's quite correct to store them in S3... but S3 is the wrong tool for storing "state." SSM Parameter Store is the wrong tool, for the same reason -- there is no way for two actions to work cooperatively when they both need to modify the "state" at the same time.
"Event triggers" are useful, of course, but from your description, the most notable "event" is not from the data, or the creation of the work item, but rather it is when the worker says "I'm ready for my next work item." It is at that point -- triggered by the web site/application code -- when the steps above are executed to select an item and assign it to a worker. (In practice, this could be browser → API Gateway → Lambda). From your description, there may be no need for the creation of a new work item to trigger an "event," or if there is, it is not the most significant among the events.
You will need a proper database for this. DynamoDB is a candidate, as is RDS.
The queues provided by SQS are designed to decouple two parts of your application -- when two processes run at different speeds, SQS is used as a buffer, allowing X to safely store the work needing to be done and then continue with something else until Y is able to do the work. SQS queues are opaque -- you can't introspect what's in the queue, you just take the next message and are responsible for handling it. On its face, that seems to partially describe what you need, but it is not a clean match for this use case. Queues are limited in how long messages can be retained, and once a message is successfully processed, it is completely gone.
Note also that SQS is only a match to your use case with the FIFO queue feature enabled, which guarantees perfect in-order delivery and exactly-once delivery -- standard SQS queues, for performance optimization reasons, do not guarantee perfect in-order delivery and may under certain conditions deliver the same message more than once, to the same consumer or a different consumer. But the SQS FIFO queue feature does not coexist with event triggers, which require standard queues.
So SQS may have a role, but you need an authoritative database to store the work and the results of the business process.
If you need to store the message, then SQS is not the best tool here, because your Lambda function would then need to process the message and finally store it somewhere, making SQS nothing but a broker.
The S3 approach gives what you need out of the box, considering you can store the files (messages) in an S3 bucket and then have one Lambda consume its event. Your Lambda would then process this event and the file would remain safe and sound on S3.
If you eventually need multiple consumers for this message, then you can send the S3 event to SNS instead and finally you could subscribe N Lambda Functions to a given SNS topic.
You appear to be worrying too much about the infrastructure at this stage and not enough on the application design. The fact that it will be serverless does not change the basic functionality of the application — it will still present a UI to users, they will still choose options that must trigger some business logic and information will still be stored in a database.
The queue you describe is merely a datastore of messages that are in a particular state. The application will have some form of business logic for determining the next message to handle, which could be based on creation timestamp, priority, location, category, user (eg VIP users who get faster response), specialization of the staff member asking for the next message, etc. This is not a "queue" but rather a calculation to be performed against all 'unresolved' messages to determine the next message to assign.
If you wish to go serverless, then the back-end will certainly be using Lambda and a database (eg DynamoDB or Amazon RDS). The application should store everything in the database so that data is available for the application's business logic. There is no need to use SQS since there really isn't a "queue", and Parameter Store is merely a way of sharing parameters amongst application components — it is not meant for core data storage.
Determine the application functionality first, then determine the appropriate architecture to make it happen.

AWS Event-Sourcing implementation

I'm quite a newbe in microservices and Event-Sourcing and I was trying to figure out a way to deploy a whole system on AWS.
As far as I know there are two ways to implement an Event-Driven architecture:
Using AWS Kinesis Data Stream
Using AWS SNS + SQS
So my base strategy is that every command is converted to an event which is stored in DynamoDB and exploit DynamoDB Streams to notify other microservices about a new event. But how? Which of the previous two solutions should I use?
The first one has the advanteges of:
Message ordering
At least one delivery
But the disadvantages are quite problematic:
No built-in autoscaling (you can achieve it using triggers)
No message visibility functionality (apparently, asking to confirm that)
No topic subscription
Very strict read transactions: you can improve it using multiple shards from what I read here you must have a not well defined number of lamdas with different invocation priorities and a not well defined strategy to avoid duplicate processing across multiple instances of the same microservice.
The second one has the advanteges of:
Is completely managed
Very high TPS
Topic subscriptions
Message visibility functionality
Drawbacks:
SQS messages are best-effort ordering, still no idea of what they means.
It says "A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order".
Does it means that giving n copies of a message the first copy is delivered in order while the others are delivered unordered compared to the other messages' copies? Or "more that one" could be "all"?
A very big thanks for every kind of advice!
I'm quite a newbe in microservices and Event-Sourcing
Review Greg Young's talk Polygot Data for more insight into what follows.
Sharing events across service boundaries has two basic approaches - a push model and a pull model. For subscribers that care about the ordering of events, a pull model is "simpler" to maintain.
The basic idea being that each subscriber tracks its own high water mark for how many events in a stream it has processed, and queries an ordered representation of the event list to get updates.
In AWS, you would normally get this representation by querying the authoritative service for the updated event list (the implementation of which could include paging). The service might provide the list of events by querying dynamodb directly, or by getting the most recent key from DynamoDB, and then looking up cached representations of the events in S3.
In this approach, the "events" that are being pushed out of the system are really just notifications, allowing the subscribers to reduce the latency between the write into Dynamo and their own read.
I would normally reach for SNS (fan-out) for broadcasting notifications. Consumers that need bookkeeping support for which notifications they have handled would use SQS. But the primary channel for communicating the ordered events is pull.
I myself haven't looked hard at Kinesis - there's some general discussion in earlier questions -- but I think Kevin Sookocheff is onto something when he writes
...if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth.
Kinesis’ primary use case is collecting, storing and processing real-time continuous data streams. Data streams are data that are generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Another thing: the fact that I'm accessing data from another
microservice stream is an anti-pattern, isn't it?
Well, part of the point of dividing a system into microservices is to reduce the coupling between the capabilities of the system. Accessing data across the microservice boundaries increases the coupling. So there's some tension there.
But basically if I'm using a pull model I need to read
data from other microservices' stream. Is it avoidable?
If you query the service you need for the information, rather than digging it out of the stream yourself, you reduce the coupling -- much like asking a service for data rather than reaching into an RDBMS and querying the tables yourself.
If you can avoid sharing the information between services at all, then you get even less coupling.
(Naive example: order fulfillment needs to know when an order has been paid for; so it needs a correlation id when the payment is made, but it doesn't need any of the other billing details.)

AWS Lambda - how to identify duplicate messages

Since several of the triggers for AWS Lambda can only guarantee message delivery "at least once" (SQS and IoT with QoS=1), I wonder what's the best way to identify a duplicate message and ignore it.
I can see that I currently get several duplicate messages, triggering my lambdas twice, causing noise and invalid data as a consequence.
In my client, I solve it by just storing a list of message IDs that I've processed, but in the Lambdas, I have nowhere to store a state.
Of course I could maintain a DB table of processed message IDs but it seems like overkill to me (and probably adds extra billed runtime to the lambdas). A simple key/value store service in memory would be enough.
What other solutions are you guys using?
I know you don't want to use a DB but dynamodb can work well for this kind of thing. If you have something you can use as a good partition key then it will still be quite performant. It will still add a very small amount of time to your lambda run time and, of course, you will be charged for your dynamodb capacity & data. I use this successfully to discard duplicate messages.
The other thing that might be worth looking into would be elasticache which has memcached and redis versions. This would be faster - if performance is a particular focus - but is not persistent like DynamoDB.