DynamoDB QuerySpec {MaxResultSize + filter expression} - amazon-web-services

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.

I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

Related

AWS Cloudwatch Log Insights: Aggregate results are impossible (count - count_distinct is negative)

I'm running a CloudWatch log insights query on a single log stream that corresponds to a single Python AWS Lambda function. This function logs a unique line corresponding to the key in s3 that it is processing. It logs this line once at the beginning of the invocation. The only condition where it won't log this line is if it fails before it even reads the event.
The query is:
parse #message /(?<#unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter #message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(#unique_key) - count_distinct(#unique_key) as #distinct_unique_keys_delta
by datefloor(#timestamp, 1d) as #_datefloor
| sort #_datefloor asc
The two regular expressions in this query will parse the full key of the s3 file being processed. In this particular problem and in general, my understanding is that the count(...) of any quantity minus the count_distinct(...) of the same quantity should always be greater than or equal to zero.
For several of the days in the results, it is a negative number.
I thought I might be misunderstanding the correct usage of datefloor(), so I tried running the following query:
parse #message /(?<#unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter #message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(#unique_key) - count_distinct(#unique_key) as #distinct_unique_keys_delta
The result was -20,347.
At this point the only scenarios I can see are
Something wrong with the code executing the query.
I'm misunderstanding this tool.
I have discovered that the count_distinct function in AWS Log Insights queries doesn't really return a distinct count! As per the documentation
Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.
Apparently I can't just assume that a function returns an accurate result.
The documentation page.

Go DynamoDB Query returns no item with Filter and Limit=1

I've following dynamoDB table
user_id
date
game
user1
2021-12-06 14:36:46
game1
user1
2021-12-06 15:36:46
game1
user1
2021-12-07 11:36:46
game2
user1
2021-12-07 12:36:46
game2
partition key: user_id
sort key: date
I want to Query the latest entry of user for game game1
(Which is the second item from table with date 2021-12-06 15:36:46). I can achieve this from code as follows;
expr, _ := expression.NewBuilder().
WithKeyCondition(expression.Key("user_id").Equal(expression.Value("user1"))).
WithFilter(expression.Name("game").Equal(expression.Value("game1"))).
Build()
var queryInput = &dynamodb.QueryInput{
KeyConditionExpression: expr.KeyCondition(),
ExpressionAttributeNames: expr.Names(),
ExpressionAttributeValues: expr.Values(),
FilterExpression: expr.Filter(),
ScanIndexForward: aws.Bool(false),
TableName: aws.String(table),
}
This returns me all items of user user1 for game game1. Problem occurs when I apply limit=1 Limit: aws.Int64(1) in QueryInput, it returns nothing. Could someone explain why is that so ?
When I change Limit: aws.Int64(4) (total number of items in table), only then the query returns single expected item. How is this limit working ?
Do I need to use game as GSI ?
The limit on a DDB parameter is applied before your filter expressions.
Essentially with a limit of 1, it retrieves 1 record, then applies the filters and returns you the items that match (0).
See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit for more details, copied in case link breaks
Limiting the Number of Items in the Result Set
The Query operation allows you to limit the number of items that it reads. To do this, set the Limit parameter to the maximum number of items that you want.
For example, suppose that you Query a table, with a Limit value of 6, and without a filter expression. The Query result contains the first six items from the table that match the key condition expression from the request.
Now suppose that you add a filter expression to the Query. In this case, DynamoDB reads up to six items, and then returns only those that match the filter expression. The final Query result contains six items or fewer, even if more items would have matched the filter expression if DynamoDB had kept reading more items.

Get latest 3 entries from DynamoDb

I have a dynamo-db table with following schema
{
"id": String [hash key]
"type": String [range key]
}
I have a usecase where I need to fetch last 3 rows for a given id when type is unknown.
Your items need a timestamp attribute. Without that they can’t be sorted out filtered by time. Once you have that, you can define a local secondary index with the id as partition key and the timestamp as the sort key. You can then get the top three items from the index.
Find more information about DynamoDb’s Local Secondary Index here.
Add a field to store the timestamp to the schema
Use query to fetch all the records for the given key
Query always returns records sorted by range key, you cannot set a sort order (without changing table's schema), so, sort the records by timestamp in your code
Get top 3 records
If you have a lot of records, use filter expressions to drop extra results. E.g. if you know that latest records will always have a timestamp not older than a hour (day, week or so) you could filter older records.

Best method to extract data from dynamoDb and move it to another table

I have a table of 500gb. I want to transfer the data to another table based on the timestamps.
There are several items in table and I want only latest entry of every item in another table.
Considering the size of table, can anyone recommend best aws service to get it done fast and easy?
I have come across aws glue, hivecopyactivity. Are this the best solution or is there any other service I can use?
(assuming you now can add a Global secondary indexes (GSI) on that table, that is: you currently have < 5 GSIs)
Define a new GSI on your table. The GSI's partition key will be x. The GSI's sort key will be timestamp. Once you have that GSI defined you can do a query on that index with ScanIndexForward set to false to get the most recent item first. You need to supply the value of x you are interested at. In the following example request it is simply set to 'abc'
{
"TableName": "<your-table-name>",
"IndexName": "<your-GSI-name>",
"KeyConditionExpression": "x = :argx",
"ExpressionAttributeValues": {
":argx": {"S": "abc"}
},
"ScanIndexForward": false,
"Limit": 1
}
This query looks at items with a given x value (as set in the ExpressionAttributeValues field) sorted in descending order (by the GSI's sort key, which is the timestamp field) and picks the first one (Limit is set to 1). As long as you do not need filtering (the FilterExpression field is empty) then you will get the result that you need by issuing a single Query request.
If you do want to use filtering you will need to do multiple requests and unset the Limit field (i.e., use its default value). See this answer for further details on those subtleties.

dynamodb - scan items where map contains a key

I have a table that contains a field (not a key field), called appsMap, and it looks like this:
appsMap = { "qa-app": "abc", "another-app": "xyz" }
I want to scan all rows whose appsMap contains the key "qa-app" (the value is not important, just the key). I tried something like this but it doesn't work in the way I need:
FilterExpression = '#appsMap.#app <> :v',
ExpressionAttributeNames = {
"#app": "qa-app",
"#appsMap": "appsMap"
},
ExpressionAttributeValues = {
":v": { "NULL": True }
},
ProjectionExpression = "deviceID"
What's the correct syntax?
Thanks.
There is a discussion on the subject here:
https://forums.aws.amazon.com/thread.jspa?threadID=164470
You might be missing this part from the example:
ExpressionAttributeValues: {":name":{"S":"Jeff"}}
However, just wanted to echo what was already being said, scan is an expensive procedure that goes through every item and thus making your database hard to scale.
Unlike with other databases, you have to do plenty of setup with Dynamo in order to get it to perform at it's great level, here is a suggestion:
1) Convert this into a root value, for example add to the root: qaExist, with possible values of 0|1 or true|false.
2) Create secondary index for the newly created value.
3) Make query on the new index specifying 0 as a search parameter.
This will make your system very fast and very scalable regardless of how many records you get in there later on.
If I understand the question correctly, you can do the following:
FilterExpression = 'attribute_exists(#0.#1)',
ExpressionAttributeNames = {
"#0": "appsMap",
"#1": "qa-app"
},
ProjectionExpression = "deviceID"
Since you're not being a bit vague about your expectations and what's happening ("I tried something like this but it doesn't work in the way I need") I'd like to mention that a scan with a filter is very different than a query.
Filters are applied on the server but only after the scan request is executed, meaning that it will still iterate over all data in your table and instead of returning you each item, it applies a filter to each response, saving you some network bandwidth, but potentially returning empty results as you page trough your entire table.
You could look into creating a GSI on the table if this is a query you expect to have to run often.