What is shards in kinesis data stream and partition key. I read aws documents but I don't get it. Can someone explain it in simple terms?
From Amazon Kinesis Data Streams Terminology and Concepts - Amazon Kinesis Data Streams:
A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacities of its shards.
So, a shard has two purposes:
A certain amount of capacity/throughput
An ordered list of messages
If your application must process all messages in order, then you can only use one shard. Think of it as a line at a bank — if there is one line, then everybody gets served in order.
However, if messages only need to be ordered for a certain subset of messages, they can be sent to separate shards. For example, multiple lines in a bank, where each line gets served in order. Or, think of a bus sending GPS coordinates. Each bus sends messages to only a single shard. A shard might contain messages from multiple buses, but each bus only sends to one shard. This way, when the messages from that shard is processed, all messages from a particular bus are processed in order.
This is controlled by using a Partition Key, which identifies the source. The partition key is hashed and assigned to a shard. Thus, all messages with the same partition key will go to the same shard.
At the back-end, there is a typically one worker per shard that is processing the messages, in order, from that shard.
If your system does not care about preserving message order, then use a random partition key. This means the message will be sent to any shard.
Related
AWS Kinesis has a fairly low write throughput of 1000 writes/sec and 1MB/writes-sec. How does Kinesis enforce this limit? If I were to try to do 1500 writes in a second, would the extra 500 writes be placed into some sort of queue or would they simply fail?
It looks like it simply fails and throws an exception.
An unsuccessfully processed record includes ErrorCode and ErrorMessage values. ErrorCode reflects the type of error and can be one of the following values: ProvisionedThroughputExceededException or InternalFailure. ErrorMessage provides more detailed information about the ProvisionedThroughputExceededException exception including the account ID, stream name, and shard ID of the record that was throttled. For more information about partially successful responses, see Adding Multiple Records with PutRecords in the Amazon Kinesis Data Streams Developer Guide.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
How the rate limiting is done
Rate Limiting
The KPL includes a rate limiting feature, which limits per-shard throughput sent from a single producer. Rate limiting is implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes. Each successful write to an Kinesis data stream adds a token (or multiple tokens) to each bucket, up to a certain threshold. This threshold is configurable but by default is set 50% higher than the actual shard limit, to allow shard saturation from a single producer.
You can lower this limit to reduce spamming due to excessive retries. However, the best practice is for each producer is to retry for maximum throughput aggressively and to handle any resulting throttling determined as excessive by expanding the capacity of the stream and implementing an appropriate partition key strategy.
https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html
This depends on the way that you're writing the data.
If you're using PutRecord then any request that exceeds the limit will fail with ProvisionedThroughputExceededException and you'll have to retry the request. However, since round-trip times for a single request are on the order of 20-30 ms, you'll need to have a large number of clients to get throttled.
The PutRecords call has a much higher likelihood of being throttled, because you can send up to 500 records in a single request. And if it's throttled, the throttling may affect the entire request or individual records within the request (this could happen if one shard accepts records but another doesn't).
To deal with this, you need to examine the Records list from the PutRecords response. This array corresponds exactly with the Records list from the request, but contains PutRecordsResultEntry values.
If an entry has a SequenceNumber then you're OK: that record was written to a shard. If, however, it has an ErrorCode then you need to copy the record from the request and re-send it (assuming that the error code is throughput exceeded; you could also try resending if it's internal error, but that may not work).
You will need to loop, calling PutRecords until the response doesn't have any unsent messages.
Beware that, due to the possibility of individual records being throttled and resent, you can't guarantee the order that records will appear on a shard (they are stored in the shard in the order that they were received).
I know that the maximum number of ConsumerGroups we can have in an eventhub is 20, and the maximum number of partitions is 32. And with EventProcessorHost, there is only one active reader per ConsumerGroup per partition. So I wanted to know what is the maximum number of consumers reading simultaneously from an eventhub is possible.
It is recommended to have a maximum of one consumer(belonging to one consumer group) processing events from one partition at one time. However, the Event Hub service supports a maximum of 5 consumers per consumer group concurrently receiving events from one partition. But obviously, since they are subscribed to the same partition and belong to same consumer group, they would be reading in the same data until each consumer maintains and reads from a different offset.
You can refer to this article from Azure docs to confirm this.
Also this blog presents a nice code snippet to test out the same support of up-to 5 concurrent consumers per partition.
So for your figures, I think, theoretically, that would make => 20(consumer groups) *5(consumers per group) *32(partitions) = 3200 active consumer running concurrently.
We currently have an application that receives a large amount of sensor data. Each sensor has its own unique sensor id (eg '5834f7718273f92cc326f620') and emits its status at different intervals. The processing order of the messages that come in is not important, for example a newer message of one sensor can be processed before an older message of another sensor. What does matter though, is that each message for a given sensor must be processed sequentially; in the order that that they arrived in the stream.
I have taken a look at the Kinesis Client Library and understand that KCL pushes messages to a single processor per shard. Does this mean that if a stream has only one shard it will have only one processor and couldn't this create a bottleneck? Or does KCL have more than one processor, and somehow, perhaps using the partition key ensures messages with the same partition key are never processed concurrently?
Note: We have taken a look at sqs fifo, but ruled it out as the 300 messages per second limit would soon become an issue.
Yes, each shard can only have one processor at a given moment (per application).
But, you can use the sensor id as the partition key for your kinesis put record request. (see here)
This will make sure that all of this sensor events will get into the same shard and processor.
If you will do that you'll be able to scale your processes and shards and still get each sensor events processed in a single processor
I am trying to make a Kinesis Consumer Client. To work on it I went through the Developer Guide of Kinesis and AWS Document http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-implementation-app-java.html.
I was wondering is It possible to get Data from two different Streams and process it accordingly.
Say I have two different Streams stream1 and stream2 .
Is it possible to get Data from both stream and process individually ?
Why not? Do get_records from both streams.
If your streams only have a single shard each, you will also see all the events, as it is recommended to process each shard with a single worker, but if your logic is somehow to join events from different sources/streams, you can implement it with a single worker reading from both streams.
Note that if you have streams with multiple shards, each one of your workers will see only a part of the events. You can have the following options:
Both streams have a single shard each - in this case you can read with a single worker from bout streams and see all events from both streams. You can add timestamps or other keys to allow you to "join" these events in the worker.
One stream (stream1) with one shard and the second streams (stream2) with multiple shards - in this case you can read from stream1 from all your workers, that will also process single shard from the stream2 each. Each one of your workers will see all the events of stream1 and its share of events of stream2. Note that you have a limit of the speed that you can read the events from stream1 with the single shard (2MB/second or 5 reads/second), and if you have many shards in stream2, this can be a real limit.
Both streams can have multiple shards - in this case it will be more complex for you to ensure that you are able to "join" these events, as you need to sync both the writes and the reads to these streams. You can also read from all shards of both streams with a single worker, but this is not a good practice as it is limiting your ability to scale since you don't have a distributed system anymore. Another option is to use the same partition_key in both streams, and have the same number of shards and partition definition for both streams, and verify that you are reading from the "right" shard from each stream in each of your workers, and that you are doing it correctly every time one of your workers is failing and restarting, which might be a bit complex.
Another option that you can consider is to write both types of events in a single stream, again using the same partition_key, and then filter them on the reader side if you need to process them differently (for example, to write them to different log files in S3).
How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.