AWS Kinesis has a fairly low write throughput of 1000 writes/sec and 1MB/writes-sec. How does Kinesis enforce this limit? If I were to try to do 1500 writes in a second, would the extra 500 writes be placed into some sort of queue or would they simply fail?
It looks like it simply fails and throws an exception.
An unsuccessfully processed record includes ErrorCode and ErrorMessage values. ErrorCode reflects the type of error and can be one of the following values: ProvisionedThroughputExceededException or InternalFailure. ErrorMessage provides more detailed information about the ProvisionedThroughputExceededException exception including the account ID, stream name, and shard ID of the record that was throttled. For more information about partially successful responses, see Adding Multiple Records with PutRecords in the Amazon Kinesis Data Streams Developer Guide.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
How the rate limiting is done
Rate Limiting
The KPL includes a rate limiting feature, which limits per-shard throughput sent from a single producer. Rate limiting is implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes. Each successful write to an Kinesis data stream adds a token (or multiple tokens) to each bucket, up to a certain threshold. This threshold is configurable but by default is set 50% higher than the actual shard limit, to allow shard saturation from a single producer.
You can lower this limit to reduce spamming due to excessive retries. However, the best practice is for each producer is to retry for maximum throughput aggressively and to handle any resulting throttling determined as excessive by expanding the capacity of the stream and implementing an appropriate partition key strategy.
https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html
This depends on the way that you're writing the data.
If you're using PutRecord then any request that exceeds the limit will fail with ProvisionedThroughputExceededException and you'll have to retry the request. However, since round-trip times for a single request are on the order of 20-30 ms, you'll need to have a large number of clients to get throttled.
The PutRecords call has a much higher likelihood of being throttled, because you can send up to 500 records in a single request. And if it's throttled, the throttling may affect the entire request or individual records within the request (this could happen if one shard accepts records but another doesn't).
To deal with this, you need to examine the Records list from the PutRecords response. This array corresponds exactly with the Records list from the request, but contains PutRecordsResultEntry values.
If an entry has a SequenceNumber then you're OK: that record was written to a shard. If, however, it has an ErrorCode then you need to copy the record from the request and re-send it (assuming that the error code is throughput exceeded; you could also try resending if it's internal error, but that may not work).
You will need to loop, calling PutRecords until the response doesn't have any unsent messages.
Beware that, due to the possibility of individual records being throttled and resent, you can't guarantee the order that records will appear on a shard (they are stored in the shard in the order that they were received).
Related
I’m using Lambda to load data records into Kinesis and often want to add up to 500K records, I am batching these into chunks of 500 and using Boto's put_records method to send them to Kinesis. I sometimes see failures due to exceeding the allowed throughput.
What is the best approach for retrying when this happens? Ideally I don’t want duplicate messages in the data stream, so I don’t want to simply resend all 500 records, but I’m struggling to see how to retry only the failed messages. The response from the put_records method doesn’t seem to be very useful.
Can I rely on the order of the response Records list being in the same order as the list I pass to putRecords?
I know I can increase the number of shards, but I’d like to significantly increase the number of parallel Lambda functions loading data to this Kinesis stream. We plan to partition data based on the source system and I can’t guarantee that multiple functions won’t write data to the same shard and exceed the allowed throughput. As a result, I don't believe that increasing shards will remove the need to a retry strategy.
Alternatively, does anybody know if KPL will automatically handle this issue for me?
Can I rely on the order of the response Records list being in the same order as the list I pass to putRecords?
Yes. You will have to rely on the order of response. Order of response records is same as request records.
Please check putrecords response, https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html.
Records:
An array of successfully and unsuccessfully processed record results, correlated with the request by natural ordering. A record that is successfully added to a stream includes SequenceNumber and ShardId in the result. A record that fails to be added to a stream includes ErrorCode and ErrorMessage in the result.
To retry the failed records you have to develop your own retry mechanism. I have written retry mechanism in python using recursive function with incremental wait between retries in following way.
import boto3
import time
kinesis_client = boto3.client('kinesis')
KINESIS_RETRY_COUNT = 10
KINESIS_RETRY_WAIT_IN_SEC = 0.1
KINESIS_STREAM_NAME = "your-kinesis-stream"
def send_to_stream(kinesis_records, retry_count):
put_response = kinesis_client.put_records(
Records=kinesis_records,
StreamName=KINESIS_STREAM_NAME
)
failed_count = put_response['FailedRecordCount']
if failed_count > 0:
if retry_count > 0:
retry_kinesis_records = []
for idx, record in enumerate(put_response['Records']):
if 'ErrorCode' in record:
retry_kinesis_records.append(kinesis_records[idx])
time.sleep(KINESIS_RETRY_WAIT_IN_SEC * (KINESIS_RETRY_COUNT - retry_count + 1))
send_to_stream(retry_kinesis_records, retry_count - 1)
else:
print(f'Not able to put records after retries. Records = {put_response["Records"]}')
In above example, you can change KINESIS_RETRY_COUNT and KINESIS_RETRY_WAIT_IN_SEC for your needs. Also you have to ensure that your lambda timeout is sufficient for retries.
Alternatively, does anybody know if KPL will automatically handle this
issue for me?
I am not sure about KPL, but from documentation It looks like it has it's own retry mechanism. https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html
While you should definitely handle the failures and resend them, one way to minimise the number of extra records to resend is to simply send 500 records, and if you have more to send, delay for 500ms before sending the next lot.
Waiting for 500ms every 500 records will limit you to 1000 records/sec which is the Kinesis PutRecords limit. Staying under this limit will minimise the number of records that have to be sent multiple times.
Only processing 500 records at a time from a larger list also could make the retry logic easier, because any records that fail can simply be appended onto the end of the master list, where they'll be retried when the loop checks to see if there are any more records in the master list left to send to Kinesis.
Just remember to put a check in to abort if the master list isn't getting any smaller on each attempt to send 500 records, which will happen if there is at least one record that is failing every time. Eventually it will be the last one in the list and will keep being sent over and over forever unless this check is in place.
Note that this applies to one shard, if you have more shards then you can adjust these limits accordingly.
1 RCU is 1 request per second, which is 4KB/sec per request for strong consistency and (4x2)8KB/sec per request for eventual consistency.
If an application gets 10 strong consistency read request per second and the RCU is 1, what happens in this scenario? DynamoDB can only respond to only 1 request per second? What happens when the RCU is 10? DynamoDB can respond to 10 request per second?
What will happen to my application if I have tens and thousands of request to a table per second?
Your requests will be throttled.
See here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ProvisionedThroughput.html
If your read or write requests exceed the throughput settings for a table,
DynamoDB can throttle that request. DynamoDB can also throttle read requests
exceeds for an index.
Throttling prevents your application from consuming too many capacity units.
When a request is throttled, it fails with an HTTP 400 code (Bad Request) and
a ProvisionedThroughputExceededException.
The AWS SDKs have built-in support for retrying throttled requests (see Error
Retries and Exponential Backoff), so you do not need to write this logic
yourself.
If I correctly understand, a CloudWatch Agent publishes events to CloudWatch by using a of kind of batching, the size of which is specified by the two params:
batch_count:
Specifies the max number of log events in a batch, up to 10000. The
default value is 1000.
batch_size
Specifies the max size of log events in a batch, in bytes, up to
1048576 bytes. The default value is 32768 bytes. This size is
calculated as the sum of all event messages in UTF-8, plus 26 bytes
for each log event.
I guess, that in order to eliminate a possibility of loosing any log data in case of a EC2 instance termination, the batch_count should be equal to 1 (because in case of the instance termination all logs will be destroyed). Am I right that this is only one way to achieve it, and how this can affect the performance? Will it have any noticeable side-effects?
Yes, it's a bad idea. You are probably more likely to lose data that way. The PutLogEvents API that the agent uses is limited to 5 requests per second per log stream (source). With a batch_count of 1, you'd only be able to publish 5 log events per second. If the application were to produce more than that consistently, the agent wouldn't be able to keep up.
If you absolutely can't afford to lose any log data, maybe you should be writing that data to a database instead. There will always be some risk of losing log data, even if with a batch_count of 1. The host could always crash before the agent polls the log file... which BTW is every 5 seconds by default (source).
We are experiencing a ProvisionedThroughputExceededException upon checkpointing many events together.
The exception stacktrace is the following:
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId-000000000000 in stream mystream under account accountid. (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: ea36760b-9db3-0acc-bbe9-87939e3270aa)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1529)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1167)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:1645)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:1621)
at com.amazonaws.services.kinesis.AmazonKinesisClient.getShardIterator(AmazonKinesisClient.java:909)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getIterator(KinesisProxy.java:291)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.SequenceNumberValidator.validateSequenceNumber(SequenceNumberValidator.java:79)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:90)
As you can see here, the exception is raised at
RecordProcessorCheckpointer.java:90
inside the KCL library. What does checkpointing has to do with exceeding the throughput?
Kinesis is rate-limited,
PutRecord requests can only process up to the limit of the provisioned throughput on the involved shard. exceeding this will throw ProvisionedThroughputExceededException
Obvious solution would be splitting stream's shard into two and divide the hash key space evenly. It might look unnecessary if your metrics are within the limits of a single shard but lets say if you use your limit of 1000 transactions/sec write capacity in first 500ms your activity for that shard will be throttled for the remaining half so there is no way you can avoid throttling with a single shard.
You can configure automatic retries after short delays for your throttled requests. check your SDK's documentation if there is any examples of this.
How do I tell what percentage of the data in a Kinesis stream a reader has already processed? I know each reader has a per-shard checkpoint sequence number, and I can also get the StartingSequenceNumber of each shard from describe-stream, however, I don't know how far along in my data the reader currently is (I don't know the latest sequence number of the shard).
I was thinking of getting a LATEST iterator for each shard and getting the last record's sequence number, however that doesn't seem to work if there's no new data since I got the LATEST iterator.
Any ideas or tools for doing this out there?
Thanks!
I suggest you implement a custom metric or metrics in your applications to track this.
For example, you could append a message send time within your Kinesis message, and on processing the message, record the time difference as an AWS CloudWatch custom metric. This would indicate how close your consumer is to the front of the stream.
You could also record the number of messages pushed (at the pushing application) and messages received at the Kinesis consumer. If you compare these in a chart on CloudWatch, you could see that the curves roughly follow each other indicating that the consumer is doing a good job at keeping up with the workload.
You could also try monitoring your Kinesis consumer, to see how often it idly waits for records (i.e, no results are returned by Kinesis, suggesting it is at the front of the stream and all records are processed)
Also note there is not a way to track a "percent" processed in the stream, since Kinesis messages expire after 24 hours (so the total number of messages is constantly rolling). There is also not a direct (API) function to count the number of messages inside your stream (unless you have recorded this as above).
If you use KCL you can do that by comparing IncomingRecords from the cloudwatch built-in metrics of Kinesis with RecordsProcessed which is a custom metric published by the KCL.
Then you select a time range and interval of say 1 day.
You would then get the following type of graphs:
As you can see there were much more records added than processed. By looking at the values in each point you will know exactly if your processor is behind or not.