Limiting and Ordering the Scan Result AWS - amazon-web-services

I am using AWS mobilehub and I create a dynamoDb table(userId, username, usertoplevel, usertopscore).
My Partition key is a string (userId) and I have created one Global Search Index (GSI) in which i make usertoplevel is Partition key and usertopscore as Sort Key. I can successfully query for all items by the following code
final DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
List<UserstopcoreDO> results;
DynamoDBMapper mapper = AWSMobileClient.defaultMobileClient().getDynamoDBMapper();
results = mapper.scan(UserstopcoreDO.class, scanExpression);
for (UserstopcoreDO usertopScore : results) {
Logger.d("SizeOfUserScore : " + usertopScore.getUsertopscore());
}
Now I have 1500+ records in the table and I want to limit the result to fetch only the top 10 users. I will be thankful if someone help.

In order to achieve this you need to move away from Scan and use Query operation.
The query operation provides you an option to specify if the index should be read forwards or in reverse.
In order to get the top 10 results, you need to limit the results returned to 10. This can be done by setting a limit on your query operation.
Therefore to summarize:
Start using a query operation instead of scan.
Set the scanIndexForward to false to start reading results in descending order.
Set a limit on your query operation to return top 10 results.
This page describes all the things I mentioned in this answer: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html

The limit can be set in the scan expression. Please read the definition of the LIMIT carefully. It is the limit for the maximum number of items to evaluate. However, you don't need worry if there is no filter expression used in scan.
In case, if you use filter expression, you may need to do recursive scan until LastEvaluatedKey is null.
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression().withLimit(10);
The maximum number of items to evaluate (not necessarily the number of
matching items). If DynamoDB processes the number of items up to the
limit while processing the results, it stops the operation and returns
the matching values up to that point, and a key in LastEvaluatedKey to
apply in a subsequent operation, so that you can pick up where you
left off. Also, if the processed data set size exceeds 1 MB before
DynamoDB reaches this limit, it stops the operation and returns the
matching values up to the limit, and a key in LastEvaluatedKey to
apply in a subsequent operation to continue the operation.

Related

DynamoDB: When does 1MB limit for queries apply

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?
Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

How to conditionally execute a SET operation in DynamoDB

I have an aggregations table in DynamoDb with the following columns: id, sum, count, max, min, and hash. I will ALWAYS want to update sum and count but will want to update min and max only when I have values greater than/lesser than the values already in the database. Also, I only want this operation to succeed when the stored hash is different from what I am sending, to prevent reprocessing the same data.
I currently have these:
UpdateExpression: ADD sum :sum ADD count :count SET hash :hash
UpdateCondition: attribute_not_exists(hash) OR hash <> :hash
The thing is that I need something like this for min and max:
SET min :min IF :min < min and something alike for max. Of course, this doesn't currently work. I could not find a suitable update function that would perform this comparision in DynamoDb. What is the proper way to achieve this.
PS.: I already was suggested doing multiple requests to dynamodb and place the max/min as UpdateConditions, but I want to avoid these multiple requests approach for data consistency reasons.
PS2.: Another way to express what I want in a JavaScript-sh way would be something like SET :min < min ? :min : min
I got to a solution to this problem by realizing that what I wanted was just not possible. There must be just one condition to the entire update and since there is no such thing as SET min = minimum(:min, min) I had to accept my fate and make more than one UpdateItem request to DynamoDB.
The nice thing is that the order of execution of these updates doesn't matter. The hard thing here is to make sure that each update is executed exactly once. Because we are firing a lot of requests (and having peaks eventually) there is a real chance of some failing updates due to ProvisionedThroughputExceededException or maybe just some rate limiting from AWS.
So here is my final solution;
Lambda function receives payload with hundreds of data points.
Lambda function aggregates this data points in memory and produces an intermediary aggregation object of the form {id, sum, count, min, max}.
Lambda function generates 3 update objects per aggregation object, of the forms (these updates are referring to the same record):
{UpdateExpression: 'ADD #SUM :sum, #COUNT :count'}
{ConditionExpression: '#MAX < :max OR attribute_not_exists(#MAX)', UpdateExpression: 'SET #MAX = :max'}
{ConditionExpression: '#MIN > :min OR attribute_not_exists(#MIN)', UpdateExpression: 'SET #MIN = :min'}
Because we need to be 100% sure that these updates will always be processed with success, then the lambda function sends them to a FIFO SQS queue (as 3 separate messages). I am not using a FIFO queue here because I want the order to be preserved but because I want the guarantee of exactly once delivery.
A consumer keeps pooling the queue and whenever there are messages it just shoots them to DynamoDB as the parameter of .updateItem.
At the end of this process, I was able to do real-time aggregations for thousands of records :)
PS.: Got rid of the hash column
It is not possible to do this in a single update since UpdateExpression doesn't support functions like max() and min(). The documentation for supported operations and functions can be found here
The best way to achieve the same effect is to add a field called latest or something similar which stores the latest value. You will need to change your update expression to be something like the following.
UpdateExpression: SET hash = :hash, latest = :latest, sum = sum + :latest, count = count + :num
Where :hash is of course your update hash to guard against replays, :latest is the latest value, and :num is 1 or whatever your increment is.
Then you can use DynamoDB Streams with a Lambda that looks at each update and checks if latest is less than min or greater than max. If not, ignore the update, otherwise perform a second update to set min or max to the latest value accordingly.
The main drawback to this approach is that there will be a small window where latest might be outside of the range of min or max however, this can be normalized easily in your application code when you read the records.
You should also consider the additional cost that will result from the DynamoDB Stream and Lambda invocations
I had a similar situation where I needed to atomically update a min value, and ended up doing this:
Let each item have an attribute of type Set (NS) keeping the candidate values for the minvalue, and when you want to set a new value that might be the new min, just add it to the set. Then at read time, find the lowest number in the set on the client side.
This is atomic and requires no condition expression, but has the downside that the set grows over time, so I added a clean up request to run as needed, for example when the set has more than N values, or simply on every get. The clean up might need to use a condition expression to be concurrent safe though, depending on if you also remove values through other use cases.
This does not solve all scenarios, but worked for me. In my case the value was a timestamp of an event in the future, and I wanted to store when the next event occurs. I could then easily also clean up by removing all values in the past.
Summary:
Set new potentially minimum value: ADD #values :value.
Read minimum value: GetItem followed by finding the lowest value in values client-side. This could if needed be combined with a clean up that finds all obsolete values, then calls UpdateItem DELETE #values [x, y, z...]

Iterate through DynamoDBQueryExpression results without going over capacity limits

I have a DynamoDB table that could contain upwards of 100,000 entries. I have a query expression that I'd like to run against this table which could return up to 5,000 entries. After I get this response list then I'm going to iterate through it and perform certain operations on each entry. So far this is what I have:
DynamoDBQueryExpression<...> query = new DynamoDBQueryExpression<...>();
DynamoDBMapper mapper = new DynamoDBMapper();
PaginatedList<..> results = mapper.query(Item.class, query);
for (Item item : results) {
item.doStuff();
}
I have read and write capacities of 20 and I need to make sure that I don't surpass those limits. How can I do that? Is there a way to change the query or PaginatedList so that it doesn't return results at a speed faster than 20 capacity units?
Use DynamoDBQueryExpression.withLimit to limit the number of items returned by each API call. This will help you "guide" the provisioned throughput when you iterate the paginated result.
withLimit is tricky, and you should keep in mind and read again-and-again that it limits the returned items per-api-call and not the items returned by a Query call.

What's the difference between BatchGetItem and Query in DynamoDB?

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.

AWS DynamoDB Query Call (with no results) Cost

I'm currently working out if I will use DynamoDB for some of a Project. I want to know if I execute a query against the Database for a specific Key and it isn't found (eg: see if this UserID is present and get contents if it is) is this Query that returns no results considered a Read and Chargeable?
I expect I will do a certain amount of queries that won't return results (polling for information) and need to factor this in.
Below is from the AWS website: http://aws.amazon.com/dynamodb/pricing/
"Note that the required number of units of Read Capacity is determined by the number of items being read per second, not the number of API calls. For example, if you need to read 500 items per second from your table, and if your items are 1KB or less, then you need 500 units of Read Capacity. It doesn’t matter if you do 500 individual GetItem calls or 50 BatchGetItem calls that each return 10 items."
You can simply check it by calling DynamoDB and looking at ConsumedCapacityUnits in the result.
For example, if you are calling a simple getItem call for an item that exists, you get something like:
Result: {Item: {fans={SS: [Billy Bob, James], }, name={S: Airplane, }, year={N: 1980, }, rating={S: *****, }},
ConsumedCapacityUnits: 0.5, }
However, when you are calling it on an item that doesn't exist, you get:
Result: {ConsumedCapacityUnits: 0.5, }
Therefore, it appears that you are consuming your capacity even if the item is not in the table, as the lookup is running nevertheless
According to this link, there's a difference between Scan and Query operations. A query with no result will not result in any cost.
To provide a more specific answer, the query operation does get billed even if you don't get any results back (the accepted answer covers the getItem operation).
According to the current documentation, all items returned are treated as a single read operation, where DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. Therefore, your 0 KB result would be rounded up to 4 KB and be billed as 0.5 or 1 read capacity unit, depending on your billing plan.
Below is a screenshot from my test database.