DynamoDb - How exactly does the throughput limit works? - amazon-web-services

Let's say I have:
A table with 100 RCUs
This table has 200 items
Each item has 4kb
As far as I understand, RCU are calculated per second and you spend 1 full RCU per 4kb (with a strongly consistent read).
1) Because of this, if I spend more than 100 RCU in one second I should get an throttling error, right?
2) How can I predict that a certain request will require more than my provisioned througput? It feels scary that at any time I can compromise the whole database by making a expensive request.
3) Let's say I want to do a scan on the whole table (get all items), so that should require 200 RCUS, But that will depend on how fast dynamodb does it right? If its too fast it will give me an error, but if it takes 2 seconds or more it should be fine, how do I account for this? How to take in account DynamoDB speed to know how much RCUs I will need? What it DynamoDB "speed"?
4) What's the difference between throttling and throughput limit exceeded?

Most of your questions are theoretical at this point , because you now (as of Nov 2018) have the option of simply telling dynamodbv to use 'on demand' mode where you no longer need to calculate or worry about RCU's. Simply enable this option, and forget about it. I had similar problems in the past because of very uneven workloads - periods of no activity and then periods where I needed to do full table scans to generate a report - and struggled to get it all working seemlessly.
I turned on 'on demand' mode, cost went down by about 70% in my case, and no more throttling errors. Your cost profile may be different, but I would defintely check out this new option.
https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/

Related

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

Why is my DynamoDB scan so fast with only 1 provisioned read capacity unit?

I made a table with 1346 items, each item being less than 4KB in size. I provisioned 1 read capacity unit, so I'd expect on average 1 item read per second. However, a simple scan of all 1346 items returns almost immediately.
What am I missing here?
This is likely down to burst capacity in which you gain your capacity over a 300 second period to use for burstable actions (such as scanning an entire table).
This would mean if you used all of these credits other interactions would suffer as they not have enough capacity available to them.
You can see the amount of consumed WCU/RCU via either CloudWatch metrics or within the DynamoDB interface itself (via the Metrics tab).
You don't give a size for your entries except to say "each item being less than 4KB". How much less?
1 RCU will support 2 eventually consistent reads per second of items up to 4KB.
To put that another way, with 1 RCU and eventually consistent reads, you can read 8KB of data per second.
If you records are 4KB, then you get 2 records/sec
1KB, 8/sec
512B, 16/sec
256B, 32/sec
So the "burst" capability already mentioned allowed you to use 55 RCU.
But the small size of your records allowed that 55 RCU to return the data "almost immediately"
There are two things working in your favor here - one is that a Scan operation takes significantly fewer RCUs than you thought it did for small items. The other thing is the "burst capacity". I'll try to explain both:
The DynamoDB pricing page says that "For items up to 4 KB in size, one RCU can perform two eventually consistent read requests per second.". This suggests that even if the item is 10 bytes in size, it costs half an RCU to read it with eventual consistency. However, although they don't state this anywhere, this cost is only true for a GetItem operation to retrieve a single item. In a Scan or Query, it turns out that you don't pay separately for each individual item. Instead, these operations scan data stored on disk sequentially, and you pay for the amount of data thus read. If you 1000 tiny items and the total size that DynamoDB had to read from disk was 80KB, you will pay 80KB/4KB/2, or 10 RCUs, not 500 RCUs.
This explains why you read 1346 items, and measured only 55 RCUs, not 1346/2 = 673.
The second thing working in your favor is that DynamoDB has the "burst capacity" capability, described here:
DynamoDB currently retains up to 5 minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quickly—even faster than the per-second provisioned throughput capacity that you've defined for your table.
So if your database existed for 5 minutes prior to your request, DynamoDB saved 300 RCUs for you, which you can use up very quickly. Since 300 RCUs is much more than you needed for your scan (55), your scan happened very quickly, without throttling.
When you do a query, the RCU count applies to the quantity of data read without considering the number of items read. So if your items are small, say a few bytes each, they can easily be queried inside a single 4KB RCU.
This is especially useful when reading many items from DynamoDB as well. It's not immediately obvious that querying many small items is far cheaper and more efficient than BatchGetting them.

Number of WCU equal to number of items to write in DynamoDB?

I have been struggling to understand the meaning of WCU in AWS DynamoDB Documentation. What I understood from AWS documentation is that
If your application needs to write 1000 items where each item is of
size 0.2KB then you need to provision 1000 WCU (i.e. 0.2/1 = 0.2 which
makes nearest 1KB, so 1000 items(to write) * 1KB() = 1000WCU)
If my above understanding is correct then I am wondering for those applications who requires to write millions of records in to DynamoDB per second, Do those application needs to provision that many millions of WCU?
Appreciate if you could clarify me.
I've used DynamoDB in past (and experienced scaling out the RCU and WCU for my application) and according to AWS docs :-
One write capacity unit represents one write per second for an item up
to 1 KB in size. If you need to write an item that is larger than 1
KB, DynamoDB will need to consume additional write capacity units. The
total number of write capacity units required depends on the item
size.
So it means, if you writing a document which is of size 4.5 KB, than it will consume 5 WCU, DyanamoDB roundoff it to next integer number.
Also your understanding
here each item is of size 0.2KB then you need to provision 1000 WCU
(i.e. 0.2/1 = 0.2 which makes nearest 1KB, so 1000 items(to write) *
1KB() = 1000WCU).
is correct.
To save the WCU, unit you need to design your system in such a way that your document size is always near to round-off.
Note :- To avoid the large cost associated with DynamoDB, if you are having lots of reads, you can use caching on top of dynamoDB, which is also suggested by them and was implemented by us as well.(If your application is write heavy, than this approach will not work and you should consider some other alternative like Elasticsearch etc).
According to http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html doc , see below thing
A caching solution can mitigate the skewed read activity for popular
items. In addition, since it reduces the amount of read activity
against the table, caching can help reduce your overall costs for
using DynamoDB.

Reduce frequency of sitecore_analytics_index update/optimization

Our content management server hosts the Lucene sitecore_analytics_index.
By default, the sitecore_analytics_index uses a TimedIndexRefreshStrategy with an interval of 1 minute. This means that every minute, Sitecore adds new analytics data to the index, and then optimizes the index.
We've found that the optimization part takes ~20 minutes for our index. In practice, this means that the index is constantly being optimized, resulting in non-stop high disk I/O.
I see two possible ways to improve the situation:
Don't run the optimize step after index updates, and implement an agent to optimize the index just once per day (as per this post). Is there a big downside to only optimizing the index, say, once per day? AFAIK it's not necessary to optimize the index after every update.
Keep running the optimize step after every index update, but increase the interval from 1 minute to something much higher. What ill-effects might we see from this?
Option 2 is easier as it is just a config change, but I suspect that updating the index less frequently might be bad (hopefully I'm wrong?). Looking in the Sitecore search log, I see that the analytics index is almost constantly being searched, but I don't know what by, so I'm not sure what might happen if I reduce the index update frequency.
Does anyone have any suggestions? Thanks.
EDIT: alternatively, what would be the impact of disabling the Sitecore analytics index entirely (and how could I do that)?

DynamoDB change in throughput

I am using DynamoDB and I want to change throughput of Dynamo tables.
Will throughput will be changed instantly or it will take some time to be affected completely?
I tried searching for the answer but could not find it even on the Amazon website.
If I change a throughput for table, how much time it will to get affected?
It won't be instantaneous. From my experience it depends greatly on the size of your current throughput and data. A small table with low throughput (less than a few hundreds of reads or writes ps), it should take a few minutes.
For larger tables with higher through puts I've seen it take a lot longer, as long as 30 minutes. Sorry this is just based on observation, I don't have any formal metrics on it. You can continue to use the table while it's updating.
This document explains how DynamoDB responds to increases in throughput: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
While this doesn't explain how long it will take to create new partitions, it explains under what circumstances a new partition would actually need to be made.