I have a use case where there will be stream of data coming and I cannot consume it at the same pace and need a buffer. This can be solved using an SNS-SQS queue. I came to know the Kinesis solves the same purpose, so what is the difference? Why should I prefer (or should not prefer) Kinesis?
Keep in mind this answer was correct for Jun 2015
After studying the issue for a while, having the same question in mind, I found that SQS (with SNS) is preferred for most use cases unless the order of the messages is important to you (SQS doesn't guarantee FIFO on messages).
There are 2 main advantages for Kinesis:
you can read the same message from several applications
you can re-read messages in case you need to.
Both advantages can be achieved by using SNS as a fan out to SQS. That means that the producer of the message sends only one message to SNS, Then the SNS fans-out the message to multiple SQSs, one for each consumer application. In this way you can have as many consumers as you want without thinking about sharding capacity.
Moreover, we added one more SQS that is subscribed to the SNS that will hold messages for 14 days. In normal case no one reads from this SQS but in case of a bug that makes us want to rewind the data we can easily read all the messages from this SQS and re-send them to the SNS. While Kinesis only provides a 7 days retention.
In conclusion, SNS+SQSs is much easier and provides most capabilities. IMO you need a really strong case to choose Kinesis over it.
On the surface they are vaguely similar, but your use case will determine which tool is appropriate. IMO, if you can get by with SQS then you should - if it will do what you want, it will be simpler and cheaper, but here is a better explanation from the AWS FAQ which gives examples of appropriate use-cases for both tools to help you decide:
FAQ's
Semantics of these technologies are different because they were designed to support different scenarios:
SNS/SQS: the items in the stream are not related to each other
Kinesis: the items in the stream are related to each other
Let's understand the difference by example.
Suppose we have a stream of orders, for each order we need to reserve some stock and schedule a delivery. Once this is complete, we can safely remove the item from the stream and start processing the next order. We are fully done with the previous order before we start the next one.
Again, we have the same stream of orders, but now our goal is to group orders by destinations. Once we have, say, 10 orders to the same place, we want to deliver them together (delivery optimization). Now the story is different: when we get a new item from the stream, we cannot finish processing it; rather we "wait" for more items to come in order to meet our goal. Moreover, if the processor process crashes, we must "restore" the state (so no order will be lost).
Once processing of one item cannot be separated from processing another one, we must have Kinesis semantics in order to handle all the cases safely.
Kinesis support multiple consumers capabilities that means same data records can be processed at a same time or different time within 24 hrs at different consumers, similar behavior in SQS can be achieved by writing into multiple queues and consumers can read from multiple queues. However writing again into multiple queue will add sub seconds {few milliseconds} latency in system.
Second, Kinesis provides routing capability to selective route data records to different shards using partition key which can be processed by particular EC2 instances and can enable micro batch calculation {Counting & aggregation}.
Working on any AWS software is easy but with SQS is easiest one. With Kinesis, there is a need to provision enough shards ahead of time, dynamically increasing number of shards to manage spike load and decrease to save cost also required to manage. it's pain in Kinesis, No such things are required with SQS. SQS is infinitely scalable.
Excerpt from AWS Documentation:
We recommend Amazon Kinesis Streams for use cases with requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are simpler when all records for a given key are routed to the same record processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a real-time dashboard and another that archives data to Amazon Redshift. You want both applications to consume data from the same stream concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that runs a few hours behind the billing application. Because Amazon Kinesis Streams stores data for up to 7 days, you can run the audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track the successful completion of each item independently. Amazon SQS tracks the ack/fail, so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acked messages and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared. With Amazon Kinesis Streams, you can scale up to a sufficient number of shards (note, however, that you'll need to provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional load spikes or the natural growth of your business. Because each buffered request can be processed independently, Amazon SQS can scale transparently to handle the load without any provisioning instructions from you.
The biggest advantage for me is the fact that Kinesis is a replayable queue, and SQS is not. So you can have multiple consumers of the same messages of Kinesis (or the same consumer at different times) where with SQS, once a message has been ack'd, it's gone from that queue.
SQS is better for worker queues because of that.
Another thing: Kinesis can trigger a Lambda, while SQS cannot. So with SQS you either have to provide an EC2 instance to process SQS messages (and deal with it if it fails), or you have to have a scheduled Lambda (which doesn't scale up or down - you get just one per minute).
Edit: This answer is no longer correct. SQS can directly trigger Lambda as of June 2018
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
The pricing models are different, so depending on your use case one or the other may be cheaper. Using the simplest case (not including SNS):
SQS charges per message (each 64 KB counts as one request).
Kinesis charges per shard per hour (1 shard can handle up to 1000 messages or 1 MB/second) and also for the amount of data you put in (every 25 KB).
Plugging in the current prices and not taking into account the free tier, if you send 1 GB of messages per day at the maximum message size, Kinesis will cost much more than SQS ($10.82/month for Kinesis vs. $0.20/month for SQS). But if you send 1 TB per day, Kinesis is somewhat cheaper ($158/month vs. $201/month for SQS).
Details: SQS charges $0.40 per million requests (64 KB each), so $0.00655 per GB. At 1 GB per day, this is just under $0.20 per month; at 1 TB per day, it comes to a little over $201 per month.
Kinesis charges $0.014 per million requests (25 KB each), so $0.00059 per GB. At 1 GB per day, this is less than $0.02 per month; at 1 TB per day, it is about $18 per month. However, Kinesis also charges $0.015 per shard-hour. You need at least 1 shard per 1 MB per second. At 1 GB per day, 1 shard will be plenty, so that will add another $0.36 per day, for a total cost of $10.82 per month. At 1 TB per day, you will need at least 13 shards, which adds another $4.68 per day, for a total cost of $158 per month.
Kinesis solves the problem of map part in a typical map-reduce scenario for streaming data. While SQS doesnt make sure of that. If you have streaming data that needs to be aggregated on a key, kinesis makes sure that all the data for that key goes to a specific shard and the shard can be consumed on a single host making the aggregation on key easier compared to SQS
Kinesis Use Cases
Log and Event Data Collection
Real-time Analytics
Mobile Data Capture
“Internet of Things” Data Feed
SQS Use Cases
Application integration
Decoupling microservices
Allocate tasks to multiple worker nodes
Decouple live user requests from intensive background work
Batch messages for future processing
I'll add one more thing nobody else has mentioned -- SQS is several orders of magnitude more expensive.
In very simple terms, and keeping costs out of the picture, the real intention of SNS-SQS are to make services loosely coupled. And this is only primary reason to use SQS where the order of the msgs are not so important and where you have more control of the messages. If you want a pattern of job queue using an SQS is again much better. Kinesis shouldn't be used be used in such cases because it is difficult to remove messages from kinesis because kinesis replays the whole batch on error. You can also use SQS as a dead letter queue for more control. With kinesis all these are possible but unheard of unless you are really critical of SQS.
If you want a nice partitioning then SQS won't be useful.
Related
We have configured DynamoDB streams to trigger a Lambda function. More than 10 million unique records will be inserted into DynamoDB table within 30 minutes and Lambda will process these records when triggered through streams.
As per DynamoDB Streams documentation, streams will expire after 24 hrs.
Question:
Does this mean that Lambda function (multiple concurrent executions) should complete processing of all 10 million records within 24hrs?
If some streams events remain to be processed after 24hrs, will they be lost?
As long as you don't throttle the lambda, it won't 'not keep up'.
What will happen is the stream will be batched depending on your settings - so if you have your settings in your dynamo stream to 5 events at once, it will bundle five events and push them toward lambda.
even if that happens hundreds of times a minute, Lambda will (assuming again you aren't purposely limiting lambda executions) spin up additional concurrent executions to handle the load.
This is standard AWS philosophy. Pretty much every serverless resource (and even some not, like EC2 with Elastics Beanstalk) are designed to seamlessly and effortless scale horizontally to handle burst traffic.
Likely your Lambda executions will be done within a couple of minutes of the last event being sent. The '24 hour time out' is against waiting for a lambda to be finished/reactivated (ie: you can set up cloudwatch events to 'hold' Dynamo Streams until certain times of the day then process everything, such as waiting until off hours to let all the streams process, then turning it off again during business hours the next day)
To give you an example that is similar - I ran 10,000 executions through an SQS into a lambda. It completed the 10,000 executions in about 15 mins. Lambda concurrency is designed to handle this kind of burst flow.
Your Dynamo Read/Write capacity is going to be hammered however, so make sure you have it set to at least dynamic and not provisioned.
UPDATE
As #Maurice pointed out in the comments, there is a Stream Limit on concurrent batches sent at a moment with Dynamo. The calculation indicates that it will fall far short even with a short lambda execution time - longer the lambda, the less likely you are to complete.
Which means, if you don't have to have those all processed as quickly as posisble you should divide up the input.
You can add an AWS SQS queue somewhere in the process. Most likely, because even with the largest batch size and and a super quick process you wont get through all them, before the insert into dynamo.
The SQS has limits on its messages of up to 14 days. This may be enough to do what you want. If you have control of the messages coming in you can insert them into an sqs queue with a wait attached to it in order to process a smaller amount inserts at once - what can be accomplished in a single day, or well slightly less. It would be
lambda to collate your inserts into an SQS queue -> SQS with a wait/smaller batch size -> Lambda to insert smaller batches into dynamo -> Dynamo Stream -> Processing Lambda
The other option is to do something similar but use a State Machine with wait times and maps. State Machines have a 1 year run time limit, so you have plenty of time with that one.
The final option is to, instead of streaming the data straight into lambda, execute lambdas to query smaller sections of the dynamo at once to process them
I was studying about DynamoDb where I am stuck on a question for which I can't find any common solution.
My question is: if I have an application with dynamodb as db with initial write capacity of 100 writes per second and there is heavy load during peak hours suppose 300 writes per sec. In order to reduce load on the db which service should I use?
My take says we should go for auto-scaling but somewhere I studied that we can use sqs for making queue for data and kinesis also if order of data is necessary.
In the old days, before DynamoDB Auto-Scaling, a common use pattern was:
The application attempts to write to DynamoDB
If the request is throttled, the application stores the information in an Amazon SQS queue
A separate process regularly checks the SQS queue and attempts to write the data to DynamoDB. If successful, it removes the message from SQS
This allowed DynamoDB to be provisioned for average workload rather than peak workload. However, it has more parts that need to be managed.
These days, DynamoDB can use adaptive capacity and burst capacity to handle temporary changes. For larger changes, you can implement DynamoDB Auto Scaling, which is probably easier to implement than the SQS method.
The best solution depends on the characteristics of your application. Can it tolerate asynchronous database writes? Can it tolerate any throttling on database writes?
If you can handle some throttling from DynamoDB when there’s a sudden increase in traffic, you should use DynamoDB autoscaling.
If throttling is not okay, but asynchronous writes are okay, then you could use SQS in front of DynamoDB to manage bursts in traffic. In this case, you should still have autoscaling enabled to ensure that your queue workers have enough throughout available to them.
If you must have synchronous writes and you can not tolerate any throttling from DynamoDB, you should use DynamoDB’s on demand mode. (However, do note that there can still be throttling if you exceed 1k WCU or 3k RCU for a single partition key.)
Of course cost is also a consideration. Using DynamoDB with autoscaling will be the most cost effective method. I’m not sure how On Demand compares to the cost of using SQS.
For example, RabbitMQ has a way in setting queue limits. If that limit is reached the new messages from publishers will be rejected, thus applying some kind of backpressure that starts from consumers to the producers. (since messages in queues means not processed by consumers).
Is there a way to assure this kind of behavior for brokers like Kinesis in which the consumers are allowed to pull messages and not the broker pushes to them, like RabbitMQ.
In case of Kinesis, similar to Kafka, the state of the consumers, offset of consumption and so on, is kept in a different entity, DynamoDB for Kinesis and I know this can be trickier to have something like unprocessed records limits out of the box.
Does anyone know if there is some settings you can use, maybe by the use of KCL / KPL client library, or something ?
No. AWS Kinesis does not provide the feature you want unfortunately. There is no way to stop producer writing into a Kinesis stream if the consumer cannot catch up in processing.
In fact this is one of the advantage of using Kinesis, it allows unlimited buffering of data up to the configured retention time for free. The only time it provides back pressure is when the producer writes too much data too fast because of the Amazon Kinesis API limit: https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
If you want a limited size "queue", maybe you want to look into AWS SQS where it has a lower limit of 12000 inflight messages?
If you do want to use Kinesis, you might want to build a custom solution to feed the consumer delay back to the producer. For example, implement custom logic in the producer to monitor the consumer delay ('MillisBehindLatest') using AWS Cloudwatch (See https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html) and stop when the consumer is falling behind.
I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.
We have a business scenario to process large number of messages. key factor is order of messages is important.
If we consider to use AWS SQS FIFO, it can process only 300 messages per second. This is not applicable for us, because we expect more than 300 messages per second and batch processing also not possible, since we can send only one message to SQS. Also FIFO can store 20,000 inflate messages. This is also low for us.
Second consideration is to use AWS DynamoDB and use a sorting attribute(like timestamp) to maintain the order of messages.
Will it be suitable to use DynamoDB for our case as we expect huge number of data? Is there any major difference between SQS and DynamoDB in terms of Performance and transactions?
dynamodb and sqs both scale 'infinitely'. SQS fifo has some limitations as you have pointed out.
if you could get sqs to work for you, it will likely be less expensive, but dynamodb will work as long as you have the ability to pay for the scale you need. Unlimited scaling can be expensive - more so with dynamodb than sqs.