What happens when multiple request of read or write occurs at the same time (same second) in DynamoDB? - amazon-web-services

1 RCU is 1 request per second, which is 4KB/sec per request for strong consistency and (4x2)8KB/sec per request for eventual consistency.
If an application gets 10 strong consistency read request per second and the RCU is 1, what happens in this scenario? DynamoDB can only respond to only 1 request per second? What happens when the RCU is 10? DynamoDB can respond to 10 request per second?
What will happen to my application if I have tens and thousands of request to a table per second?

Your requests will be throttled.
See here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ProvisionedThroughput.html
If your read or write requests exceed the throughput settings for a table,
DynamoDB can throttle that request. DynamoDB can also throttle read requests
exceeds for an index.
Throttling prevents your application from consuming too many capacity units.
When a request is throttled, it fails with an HTTP 400 code (Bad Request) and
a ProvisionedThroughputExceededException.
The AWS SDKs have built-in support for retrying throttled requests (see Error
Retries and Exponential Backoff), so you do not need to write this logic
yourself.

Related

Getting Cloud Run Rate exceeded error, when just two requests are being proceesed

cloud Run is configured with default concurrency 80 , so when I was testing two simultaneous connection, how can error "Rate exceeded" be thrown?
What happens if the number of requests exceed concurrency, suppose concurrency is set to two, then if third, fourth and fifth requests comes and first and second request has not finished, does these requests wait per Request timeout ? or not served at all ?

How to implement resiliency (retry) in a nested service call chain

We have a webpage that queries an item from an API gateway which in turn calls a service that calls another service and so on.
Webpage --> API Gateway --> service#1 --> service#2 --> data store (RDMS, S3, Azure blob)
We want to make the operation resilient so we added a retry mechanism at every layer.
Webpage --retry--> API Gateway --retry--> service#1 --retry--> service#2 --retry--> data store.
This however could case a cascading failure because if the data store doesn't response on time, it will cause every layer to timeout and retry. In other words, if each layer has the same connection timeout and is configured to retry 3 times, then there will be a total of 81 retries to the data store (which is called a retry storm).
One way to fix this is to increase the timeout at each layer in order to give the layer below time to retry.
Webpage --5m timeout--> API Gateway --2m timeout--> service#1
This however is unacceptable because the timeout at the webpage will be too long.
How should I address this problem?
Should there only be one layer that retries? Which layer? And how can the layer know if the error is transient?
A couple possible solutions (and you can/should use both) would be to retry on different conditions and implement rate limiters/circuit breakers.
Retry On is a technique where you don't retry on every condition, but only specific conditions. This could be a specific error code or a specific header value. E.g. in your current situation, DO NOT retry on timeouts; only retry on server failures. In addition, you could have each layer retry on different conditions
Rate limiting would be to stick either a local or global rate limiter service inline to the connections. This would just help to short-circuit the thundering herd in the case that it starts up. E.g. rate limit the data layer to X req/s (insert real values here) and the gateway to Y req/s and then even if a service attempts lots of retries it won't pass too far down the chain. Similarly to this is circuit breaking, where each layer only permits X active connections to any downstream, so just another way to slow those retry storms.

Loading multiple records to Kinesis using PutRecords - how to re-send only failed records in case of failure?

I’m using Lambda to load data records into Kinesis and often want to add up to 500K records, I am batching these into chunks of 500 and using Boto's put_records method to send them to Kinesis. I sometimes see failures due to exceeding the allowed throughput.
What is the best approach for retrying when this happens? Ideally I don’t want duplicate messages in the data stream, so I don’t want to simply resend all 500 records, but I’m struggling to see how to retry only the failed messages. The response from the put_records method doesn’t seem to be very useful.
Can I rely on the order of the response Records list being in the same order as the list I pass to putRecords?
I know I can increase the number of shards, but I’d like to significantly increase the number of parallel Lambda functions loading data to this Kinesis stream. We plan to partition data based on the source system and I can’t guarantee that multiple functions won’t write data to the same shard and exceed the allowed throughput. As a result, I don't believe that increasing shards will remove the need to a retry strategy.
Alternatively, does anybody know if KPL will automatically handle this issue for me?
Can I rely on the order of the response Records list being in the same order as the list I pass to putRecords?
Yes. You will have to rely on the order of response. Order of response records is same as request records.
Please check putrecords response, https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html.
Records:
An array of successfully and unsuccessfully processed record results, correlated with the request by natural ordering. A record that is successfully added to a stream includes SequenceNumber and ShardId in the result. A record that fails to be added to a stream includes ErrorCode and ErrorMessage in the result.
To retry the failed records you have to develop your own retry mechanism. I have written retry mechanism in python using recursive function with incremental wait between retries in following way.
import boto3
import time
kinesis_client = boto3.client('kinesis')
KINESIS_RETRY_COUNT = 10
KINESIS_RETRY_WAIT_IN_SEC = 0.1
KINESIS_STREAM_NAME = "your-kinesis-stream"
def send_to_stream(kinesis_records, retry_count):
put_response = kinesis_client.put_records(
Records=kinesis_records,
StreamName=KINESIS_STREAM_NAME
)
failed_count = put_response['FailedRecordCount']
if failed_count > 0:
if retry_count > 0:
retry_kinesis_records = []
for idx, record in enumerate(put_response['Records']):
if 'ErrorCode' in record:
retry_kinesis_records.append(kinesis_records[idx])
time.sleep(KINESIS_RETRY_WAIT_IN_SEC * (KINESIS_RETRY_COUNT - retry_count + 1))
send_to_stream(retry_kinesis_records, retry_count - 1)
else:
print(f'Not able to put records after retries. Records = {put_response["Records"]}')
In above example, you can change KINESIS_RETRY_COUNT and KINESIS_RETRY_WAIT_IN_SEC for your needs. Also you have to ensure that your lambda timeout is sufficient for retries.
Alternatively, does anybody know if KPL will automatically handle this
issue for me?
I am not sure about KPL, but from documentation It looks like it has it's own retry mechanism. https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html
While you should definitely handle the failures and resend them, one way to minimise the number of extra records to resend is to simply send 500 records, and if you have more to send, delay for 500ms before sending the next lot.
Waiting for 500ms every 500 records will limit you to 1000 records/sec which is the Kinesis PutRecords limit. Staying under this limit will minimise the number of records that have to be sent multiple times.
Only processing 500 records at a time from a larger list also could make the retry logic easier, because any records that fail can simply be appended onto the end of the master list, where they'll be retried when the loop checks to see if there are any more records in the master list left to send to Kinesis.
Just remember to put a check in to abort if the master list isn't getting any smaller on each attempt to send 500 records, which will happen if there is at least one record that is failing every time. Eventually it will be the last one in the list and will keep being sent over and over forever unless this check is in place.
Note that this applies to one shard, if you have more shards then you can adjust these limits accordingly.

How does AWS Kinesis throttle write throughput?

AWS Kinesis has a fairly low write throughput of 1000 writes/sec and 1MB/writes-sec. How does Kinesis enforce this limit? If I were to try to do 1500 writes in a second, would the extra 500 writes be placed into some sort of queue or would they simply fail?
It looks like it simply fails and throws an exception.
An unsuccessfully processed record includes ErrorCode and ErrorMessage values. ErrorCode reflects the type of error and can be one of the following values: ProvisionedThroughputExceededException or InternalFailure. ErrorMessage provides more detailed information about the ProvisionedThroughputExceededException exception including the account ID, stream name, and shard ID of the record that was throttled. For more information about partially successful responses, see Adding Multiple Records with PutRecords in the Amazon Kinesis Data Streams Developer Guide.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
How the rate limiting is done
Rate Limiting
The KPL includes a rate limiting feature, which limits per-shard throughput sent from a single producer. Rate limiting is implemented using a token bucket algorithm with separate buckets for both Kinesis Data Streams records and bytes. Each successful write to an Kinesis data stream adds a token (or multiple tokens) to each bucket, up to a certain threshold. This threshold is configurable but by default is set 50% higher than the actual shard limit, to allow shard saturation from a single producer.
You can lower this limit to reduce spamming due to excessive retries. However, the best practice is for each producer is to retry for maximum throughput aggressively and to handle any resulting throttling determined as excessive by expanding the capacity of the stream and implementing an appropriate partition key strategy.
https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html
This depends on the way that you're writing the data.
If you're using PutRecord then any request that exceeds the limit will fail with ProvisionedThroughputExceededException and you'll have to retry the request. However, since round-trip times for a single request are on the order of 20-30 ms, you'll need to have a large number of clients to get throttled.
The PutRecords call has a much higher likelihood of being throttled, because you can send up to 500 records in a single request. And if it's throttled, the throttling may affect the entire request or individual records within the request (this could happen if one shard accepts records but another doesn't).
To deal with this, you need to examine the Records list from the PutRecords response. This array corresponds exactly with the Records list from the request, but contains PutRecordsResultEntry values.
If an entry has a SequenceNumber then you're OK: that record was written to a shard. If, however, it has an ErrorCode then you need to copy the record from the request and re-send it (assuming that the error code is throughput exceeded; you could also try resending if it's internal error, but that may not work).
You will need to loop, calling PutRecords until the response doesn't have any unsent messages.
Beware that, due to the possibility of individual records being throttled and resent, you can't guarantee the order that records will appear on a shard (they are stored in the shard in the order that they were received).

Checkpointing records with Amazon KCL throws ProvisionedThroughputExceededException

We are experiencing a ProvisionedThroughputExceededException upon checkpointing many events together.
The exception stacktrace is the following:
com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId-000000000000 in stream mystream under account accountid. (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: ea36760b-9db3-0acc-bbe9-87939e3270aa)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1529)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1167)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:948)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:618)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$300(AmazonHttpClient.java:586)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:573)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:445)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:1645)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:1621)
at com.amazonaws.services.kinesis.AmazonKinesisClient.getShardIterator(AmazonKinesisClient.java:909)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getIterator(KinesisProxy.java:291)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.SequenceNumberValidator.validateSequenceNumber(SequenceNumberValidator.java:79)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:90)
As you can see here, the exception is raised at
RecordProcessorCheckpointer.java:90
inside the KCL library. What does checkpointing has to do with exceeding the throughput?
Kinesis is rate-limited,
PutRecord requests can only process up to the limit of the provisioned throughput on the involved shard. exceeding this will throw ProvisionedThroughputExceededException
Obvious solution would be splitting stream's shard into two and divide the hash key space evenly. It might look unnecessary if your metrics are within the limits of a single shard but lets say if you use your limit of 1000 transactions/sec write capacity in first 500ms your activity for that shard will be throttled for the remaining half so there is no way you can avoid throttling with a single shard.
You can configure automatic retries after short delays for your throttled requests. check your SDK's documentation if there is any examples of this.