Iterate through DynamoDBQueryExpression results without going over capacity limits - amazon-web-services

I have a DynamoDB table that could contain upwards of 100,000 entries. I have a query expression that I'd like to run against this table which could return up to 5,000 entries. After I get this response list then I'm going to iterate through it and perform certain operations on each entry. So far this is what I have:
DynamoDBQueryExpression<...> query = new DynamoDBQueryExpression<...>();
DynamoDBMapper mapper = new DynamoDBMapper();
PaginatedList<..> results = mapper.query(Item.class, query);
for (Item item : results) {
item.doStuff();
}
I have read and write capacities of 20 and I need to make sure that I don't surpass those limits. How can I do that? Is there a way to change the query or PaginatedList so that it doesn't return results at a speed faster than 20 capacity units?

Use DynamoDBQueryExpression.withLimit to limit the number of items returned by each API call. This will help you "guide" the provisioned throughput when you iterate the paginated result.
withLimit is tricky, and you should keep in mind and read again-and-again that it limits the returned items per-api-call and not the items returned by a Query call.

Related

DynamoDB Scan/Query Return x Number of Items

If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:
The maximum number of items to evaluate (not necessarily the number of
matching items).
So the problem with this is if you set filters and such it won't return all the items.
My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.
The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.
But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?
How would you achieve this goal?
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.
You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.
Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.
First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.
You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo
Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.
In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.
The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.

Limiting and Ordering the Scan Result AWS

I am using AWS mobilehub and I create a dynamoDb table(userId, username, usertoplevel, usertopscore).
My Partition key is a string (userId) and I have created one Global Search Index (GSI) in which i make usertoplevel is Partition key and usertopscore as Sort Key. I can successfully query for all items by the following code
final DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
List<UserstopcoreDO> results;
DynamoDBMapper mapper = AWSMobileClient.defaultMobileClient().getDynamoDBMapper();
results = mapper.scan(UserstopcoreDO.class, scanExpression);
for (UserstopcoreDO usertopScore : results) {
Logger.d("SizeOfUserScore : " + usertopScore.getUsertopscore());
}
Now I have 1500+ records in the table and I want to limit the result to fetch only the top 10 users. I will be thankful if someone help.
In order to achieve this you need to move away from Scan and use Query operation.
The query operation provides you an option to specify if the index should be read forwards or in reverse.
In order to get the top 10 results, you need to limit the results returned to 10. This can be done by setting a limit on your query operation.
Therefore to summarize:
Start using a query operation instead of scan.
Set the scanIndexForward to false to start reading results in descending order.
Set a limit on your query operation to return top 10 results.
This page describes all the things I mentioned in this answer: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
The limit can be set in the scan expression. Please read the definition of the LIMIT carefully. It is the limit for the maximum number of items to evaluate. However, you don't need worry if there is no filter expression used in scan.
In case, if you use filter expression, you may need to do recursive scan until LastEvaluatedKey is null.
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression().withLimit(10);
The maximum number of items to evaluate (not necessarily the number of
matching items). If DynamoDB processes the number of items up to the
limit while processing the results, it stops the operation and returns
the matching values up to that point, and a key in LastEvaluatedKey to
apply in a subsequent operation, so that you can pick up where you
left off. Also, if the processed data set size exceeds 1 MB before
DynamoDB reaches this limit, it stops the operation and returns the
matching values up to the limit, and a key in LastEvaluatedKey to
apply in a subsequent operation to continue the operation.

What's the difference between BatchGetItem and Query in DynamoDB?

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.

Dynamo DB provisionedthroughput for paginated query

I have a small doubt regarding the READ capacity unit consumption when i query a dynamo db table with a LIMIT set on it.
Say my query expression could return 100 matching items if i iterate it with LastEvaluatedKey but if the limit is set to 20 and i dont iterate all pages( i want top 20 only) then how much read capacity unit will be consumed ? Is it going to be for 100 items or only for the retrieved 20 items?I have read the documentation but could not find anything clearly mentioning the paginated cases.
Here, throughput is the data sent over the network.
When you specify some limit (20 in your case) then only that number of rows are transfered at that time. And in case of no limit, maximum of 1 MB of data will be send.
Number of read capacity unit consumed on some query depends upon the size of your result.
In case of read operations - 4KB = 1 unit
and for write operations - 2KB = 1 unit.
For example if you query returned 15KB of data then your read units consumed will be - 15/4 = 4 read units.
The Limit parameter will tell DynamoDB how many items to examine. The Read Capacity Units consumed by that query will depend on the size of the items in your table. You will consume the RCU necessary for DynamoDB to look at the first 20 items.
If you are using a filter, you may not receive all 20 of those items. If you have a filter and you need 20 results, you will need to count the number of results and paginate until you have received 20 results. DynamoDB cannot do that counting for you.
Reference: DynamoDB Documentation for Limit

AWS DynamoDB Query Call (with no results) Cost

I'm currently working out if I will use DynamoDB for some of a Project. I want to know if I execute a query against the Database for a specific Key and it isn't found (eg: see if this UserID is present and get contents if it is) is this Query that returns no results considered a Read and Chargeable?
I expect I will do a certain amount of queries that won't return results (polling for information) and need to factor this in.
Below is from the AWS website: http://aws.amazon.com/dynamodb/pricing/
"Note that the required number of units of Read Capacity is determined by the number of items being read per second, not the number of API calls. For example, if you need to read 500 items per second from your table, and if your items are 1KB or less, then you need 500 units of Read Capacity. It doesn’t matter if you do 500 individual GetItem calls or 50 BatchGetItem calls that each return 10 items."
You can simply check it by calling DynamoDB and looking at ConsumedCapacityUnits in the result.
For example, if you are calling a simple getItem call for an item that exists, you get something like:
Result: {Item: {fans={SS: [Billy Bob, James], }, name={S: Airplane, }, year={N: 1980, }, rating={S: *****, }},
ConsumedCapacityUnits: 0.5, }
However, when you are calling it on an item that doesn't exist, you get:
Result: {ConsumedCapacityUnits: 0.5, }
Therefore, it appears that you are consuming your capacity even if the item is not in the table, as the lookup is running nevertheless
According to this link, there's a difference between Scan and Query operations. A query with no result will not result in any cost.
To provide a more specific answer, the query operation does get billed even if you don't get any results back (the accepted answer covers the getItem operation).
According to the current documentation, all items returned are treated as a single read operation, where DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. Therefore, your 0 KB result would be rounded up to 4 KB and be billed as 0.5 or 1 read capacity unit, depending on your billing plan.
Below is a screenshot from my test database.