We currently have an application that receives a large amount of sensor data. Each sensor has its own unique sensor id (eg '5834f7718273f92cc326f620') and emits its status at different intervals. The processing order of the messages that come in is not important, for example a newer message of one sensor can be processed before an older message of another sensor. What does matter though, is that each message for a given sensor must be processed sequentially; in the order that that they arrived in the stream.
I have taken a look at the Kinesis Client Library and understand that KCL pushes messages to a single processor per shard. Does this mean that if a stream has only one shard it will have only one processor and couldn't this create a bottleneck? Or does KCL have more than one processor, and somehow, perhaps using the partition key ensures messages with the same partition key are never processed concurrently?
Note: We have taken a look at sqs fifo, but ruled it out as the 300 messages per second limit would soon become an issue.
Yes, each shard can only have one processor at a given moment (per application).
But, you can use the sensor id as the partition key for your kinesis put record request. (see here)
This will make sure that all of this sensor events will get into the same shard and processor.
If you will do that you'll be able to scale your processes and shards and still get each sensor events processed in a single processor
Related
I do not care much about the order of events but I would like the message to be processed exactly once. The lambda listening to SQS messages will store it in DynamoDB so throughput is pretty important as I have multiple microservices (as producers) writing messages to this SQS that will be read by a single microservice.
About processing messages exactly once, that is something that FIFO queue supports but is said to have not a good throughput.
Is the throughput of the FIFO queue the same as the Standard queue if each message has a unique groupId?
If not, my next option is probably to use "attribute_not_exists" in DynamoDB while storing the message.
Which of these should work better?
Messages / sec
FIFO
30,000 messages (with batching + high throughput mode)
3,000 messages (without batching + high throughput mode)
3,000 messages (with batching)
300 messages (without batching)
Standard
Nearly unlimited
https://aws.amazon.com/sqs/faqs/
To process exactly once, you need to use FIFO queue with de-deplication ID.
If your throughput requirement is below the limit mentioned above, then you're fine with the FIFO queue.
If not then, using DynamoDB as your original plan is also an alternative option. But you have to manage a lot of things yourself here with this approach like deleting the message, updating if the message is being read but not yet fully processed, and so on.
FIFO SQS queues have different rate limits than a regular SQS queue regardless of the use of message group ids
SQS Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
FIFO SQS supports 300 TPS for each API method
Look at the quota docs here
Also, AWS has a new feature for higher throughput FIFO SQS queue which might interest you
With batching of maximum 10 messages per API call you can handle 3,000 messages per second with FIFO queue
Regarding making sure you don't handle the same message twice - have you had a look at FIFO de-duplication ID? I am not sure if that's exactly what you need but it sounds pretty similar to your requirement
SQS delivery guarantee is at least once. Your application must be designed to handle processing duplicate messages.
I'd strongly recommend building your application this way.
If you must process some type of data exactly once, you need a strongly consistent system. Consider using dynamodb and conditional updates
I have a DynamoDB Table and it is linked with one Stream, and that stream is linked with one lambda function which processed it.
Question - With above set up if an event comes to the stream and is ingested in Lambda, does that event still resides in that stream or it gets POPPED out as soon as it got ingested in Lambda just like a Queue?
Question 2 Can someone kindly tell me about the inner working of DDB Stream and how it passes the data to Lambda? Like are there any states for the stream events?
P.S: AWS Documentation says that events stay in stream for 24 hour window.
There are two concepts to understand here
Streams
Triggers
Whenever there is a change in the table like an addition, update or deletion, the Kinesis Stream feature of AWS stores that change for a period of 24 hrs. It does this through 4 methods:-
Keys only:- only the keys are stored after the change
New image:- The entire item on which the change is performed is stored
Old image:- When a change is performed on an item, the old item is stored instead of the new one
New and old:- self-explanatory
To associate a lambda function with your streams, a feature called Triggers are used. The changes invoke the Trigger which in-turn performs the lambda function associated with the change.
Part 1 of your question:-
By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires. If the Lambda fails it will try and process that message indefinitely (or until it expires), keeping other messages from being processed as a result. To avoid stalled shards(I'll talk about this later), you can configure the event source mapping to retry with smaller batch size, limit the number of retries, or discard records that are too old(you can set the age of the record that lambda can read).
Part 2 of your question:-
The streams which we are talking about are Kinesis Streams It is a feature to be used by multiple producers and consumers. Here the producer is DynamoDb and the consumer is lambda. Consumers have dedicated read throughput so they don't have to compete with other consumers of the same data. With consumers, Kinesis pushes records to Lambda over an HTTP/2 connection, which can also reduce latency between adding a record and function invocation.
The capacity of the streams is determined by the number of shards it contains. Shards are small units of capacity in the Stream. Hence higher the shard value, higher the capacity.
I guess I have explained the working in the part1 of this answer. Feel free to ask follow up questions.
What is shards in kinesis data stream and partition key. I read aws documents but I don't get it. Can someone explain it in simple terms?
From Amazon Kinesis Data Streams Terminology and Concepts - Amazon Kinesis Data Streams:
A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.
So, a shard has two purposes:
A certain amount of capacity/throughput
An ordered list of messages
If your application must process all messages in order, then you can only use one shard. Think of it as a line at a bank — if there is one line, then everybody gets served in order.
However, if messages only need to be ordered for a certain subset of messages, they can be sent to separate shards. For example, multiple lines in a bank, where each line gets served in order. Or, think of a bus sending GPS coordinates. Each bus sends messages to only a single shard. A shard might contain messages from multiple buses, but each bus only sends to one shard. This way, when the messages from that shard is processed, all messages from a particular bus are processed in order.
This is controlled by using a Partition Key, which identifies the source. The partition key is hashed and assigned to a shard. Thus, all messages with the same partition key will go to the same shard.
At the back-end, there is a typically one worker per shard that is processing the messages, in order, from that shard.
If your system does not care about preserving message order, then use a random partition key. This means the message will be sent to any shard.
How are checkpoints and trimming related in AWS KCL library?
The documentation page Handling Startup, Shutdown, and Throttling says:
By default, the KCL begins reading records from the tip of the
stream;, which is the most recently added record. In this
configuration, if a data-producing application adds records to the
stream before any receiving record processors are running, the records
are not read by the record processors after they start up.
To change the behavior of the record processors so that it always
reads data from the beginning of the stream, set the following value
in the properties file for your Amazon Kinesis Streams application:
initialPositionInStream = TRIM_HORIZON
The documentation page Developing an Amazon Kinesis Client Library Consumer in Java says:
Streams requires the record processor to keep track of the records
that have already been processed in a shard. The KCL takes care of
this tracking for you by passing a checkpointer
(IRecordProcessorCheckpointer) to processRecords. The record processor
calls the checkpoint method on this interface to inform the KCL of how
far it has progressed in processing the records in the shard. In the
event that the worker fails, the KCL uses this information to restart
the processing of the shard at the last known processed record.
The first page seems to say that the KCL resumes at the tip of the stream, the second page at the last known processed record (that was marked as processed by the RecordProcessor using the checkpointer). In my case, I definitely need to restart at the last known processed record. Do I need to set the initialPositionInStream to TRIM_HORIZON?
With kinesis stream you have two options, you can read the newest records, or start from the oldest (TRIM_HORIZON).
But, once you started your application it just reads from the position it stopped using its checkpoints.
You can see those checkpoints in dynamodb (Usually the table name is as the app name).
So if you restart your app it will usually continue from where it stopped.
The answer is no, you don't need to set the initialPositionInStream to TRIM_HORIZON.
When you are reading events from a kinesis stream, you have 4 options:
TRIM_HORIZON - the oldest events that are still in the stream shards before they are automatically trimmed (default 1 day, but can be extended up to 7 days). You will use this option if you want to start a new application that will process all the records that are available in the stream, but it will take a while until it is able to catch up and start processing the events in real-time.
LATEST - the newest events in the stream, and ignore all the past events. You will use this option if you start a new application that you want to process in teal time immediately.
AT/AFTER_SEQUENCE_NUMBER - the sequence number is usually the checkpoint that you are keeping while you are processing the events. These checkpoints are allowing you to reliably process the events, even in cases of reader failure or when you want to update its version and continue processing all the events and not lose any of them. The difference between AT/AFTER is based on the time of your checkpoint, before or after you processed the events successfully.
Please note that this is the only shard specific option, as all the other options are global to the stream. When you are using the KCL it is managing a DynamoDB table for that application with a record for each shard with the "current" sequence number for that shard.
AT_TIMESTAMP - the estimate time of the event put into the stream. You will use this option if you want to find specific events to process based on their timestamp. For example, when you know that you had a real life event in your service at a specific time, you can develop an application that will process these specific events, even if you don't have the sequence number.
See more details in Kinesis documentation here: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html
You should use the "TRIM_HORIZON". It will only have effect on the first time your application starts to read records from the stream.
After that, it will continue from last known position.
How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.