DynamoDB: When does 1MB limit for queries apply - amazon-web-services

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?

Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

Related

What is the best performance I can get by querying DynamoDB for a maximum 1MB?

I am using DynamoDB for storing data. And I see 1MB is the hard limit for a query to return. I have a case that queries a table to fetch 1MB of data in one partition. I'd like to know what the best performance I can get.
Based on DynamoDB doc, one partition can have a maximum of 3000 RCU. If I send an eventual consistency read, it should support responding 3000 * 8KB = 24000KB = 23MB per second.
If I send one query request to fetch 1MB from one partition, does this mean it should respond 1/23 second = 43 milliseconds?
I am testing in a lambda sends a query to DynamoDB with XRay enabled. It shows me the query takes 300ms more based on XRay trace. So I don't understand why may cause the long latency.
What can I do if I want to reduce the latency to a single-digit millisecond? I don't want to split the partition since 1MB is not really big size.
DynamoDB really is capable of single-digit millisecond latency, but if the item size is small enough to fit into 1 RCU. Reading 1 MB of data from a database in <10ms is a challenging task itself.
Here is what you can try:
Split your read operation into two.
One will query with ScanIndexForward: true + Limit: N/2 and another will query with ScanIndexForward: false + Limit: N/2. The idea is to query the same data from both ends to the middle.
Do this in parallel and then you merge two responses into one.
However, this is likely will decrease latency from 300 to 150ms, which is still not <10ms.
Use DAX - DynamoDB Caching Layer
If your 1 MB of data is spread across thousands of items, consider using fewer items but each item will hold more data inside itself.
Consider using a compression algorithm like brotli to compress the data you store in 1 DynamoDB item. Once I had success with this approach. Depending on the format, it can easily reduce your data size by 4x, which translates into ~4x faster query time! Which could be 8x faster with the approach described in item #1.
Also, beware, that constantly reading 1 MB of data from a database will incur huge costs.

Why is my DynamoDB scan so fast with only 1 provisioned read capacity unit?

I made a table with 1346 items, each item being less than 4KB in size. I provisioned 1 read capacity unit, so I'd expect on average 1 item read per second. However, a simple scan of all 1346 items returns almost immediately.
What am I missing here?
This is likely down to burst capacity in which you gain your capacity over a 300 second period to use for burstable actions (such as scanning an entire table).
This would mean if you used all of these credits other interactions would suffer as they not have enough capacity available to them.
You can see the amount of consumed WCU/RCU via either CloudWatch metrics or within the DynamoDB interface itself (via the Metrics tab).
You don't give a size for your entries except to say "each item being less than 4KB". How much less?
1 RCU will support 2 eventually consistent reads per second of items up to 4KB.
To put that another way, with 1 RCU and eventually consistent reads, you can read 8KB of data per second.
If you records are 4KB, then you get 2 records/sec
1KB, 8/sec
512B, 16/sec
256B, 32/sec
So the "burst" capability already mentioned allowed you to use 55 RCU.
But the small size of your records allowed that 55 RCU to return the data "almost immediately"
There are two things working in your favor here - one is that a Scan operation takes significantly fewer RCUs than you thought it did for small items. The other thing is the "burst capacity". I'll try to explain both:
The DynamoDB pricing page says that "For items up to 4 KB in size, one RCU can perform two eventually consistent read requests per second.". This suggests that even if the item is 10 bytes in size, it costs half an RCU to read it with eventual consistency. However, although they don't state this anywhere, this cost is only true for a GetItem operation to retrieve a single item. In a Scan or Query, it turns out that you don't pay separately for each individual item. Instead, these operations scan data stored on disk sequentially, and you pay for the amount of data thus read. If you 1000 tiny items and the total size that DynamoDB had to read from disk was 80KB, you will pay 80KB/4KB/2, or 10 RCUs, not 500 RCUs.
This explains why you read 1346 items, and measured only 55 RCUs, not 1346/2 = 673.
The second thing working in your favor is that DynamoDB has the "burst capacity" capability, described here:
DynamoDB currently retains up to 5 minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quickly—even faster than the per-second provisioned throughput capacity that you've defined for your table.
So if your database existed for 5 minutes prior to your request, DynamoDB saved 300 RCUs for you, which you can use up very quickly. Since 300 RCUs is much more than you needed for your scan (55), your scan happened very quickly, without throttling.
When you do a query, the RCU count applies to the quantity of data read without considering the number of items read. So if your items are small, say a few bytes each, they can easily be queried inside a single 4KB RCU.
This is especially useful when reading many items from DynamoDB as well. It's not immediately obvious that querying many small items is far cheaper and more efficient than BatchGetting them.

DynamoDB Scan/Query Return x Number of Items

If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:
The maximum number of items to evaluate (not necessarily the number of
matching items).
So the problem with this is if you set filters and such it won't return all the items.
My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.
The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.
But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?
How would you achieve this goal?
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.
You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.
Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.
First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.
You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo
Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.
In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.
The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.

What's the difference between BatchGetItem and Query in DynamoDB?

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.

Dynamo DB provisionedthroughput for paginated query

I have a small doubt regarding the READ capacity unit consumption when i query a dynamo db table with a LIMIT set on it.
Say my query expression could return 100 matching items if i iterate it with LastEvaluatedKey but if the limit is set to 20 and i dont iterate all pages( i want top 20 only) then how much read capacity unit will be consumed ? Is it going to be for 100 items or only for the retrieved 20 items?I have read the documentation but could not find anything clearly mentioning the paginated cases.
Here, throughput is the data sent over the network.
When you specify some limit (20 in your case) then only that number of rows are transfered at that time. And in case of no limit, maximum of 1 MB of data will be send.
Number of read capacity unit consumed on some query depends upon the size of your result.
In case of read operations - 4KB = 1 unit
and for write operations - 2KB = 1 unit.
For example if you query returned 15KB of data then your read units consumed will be - 15/4 = 4 read units.
The Limit parameter will tell DynamoDB how many items to examine. The Read Capacity Units consumed by that query will depend on the size of the items in your table. You will consume the RCU necessary for DynamoDB to look at the first 20 items.
If you are using a filter, you may not receive all 20 of those items. If you have a filter and you need 20 results, you will need to count the number of results and paginate until you have received 20 results. DynamoDB cannot do that counting for you.
Reference: DynamoDB Documentation for Limit