Filter Dynamo DB rows - amazon-web-services

I want to filter all dynamo db rows where 2 columns have same value
table = client.Table('XXX')
response = table.query(
KeyConditionExpression=Key('column1').eq(KeyConditionExpression=Key('column2'))
)
this is wrong as we can't pass KeyConditionExpression inside eq statement. I don't want to scan through all the rows and filter the rows.
Scanned through multipole resources and answers but every resources talks about the multiple column checking with some value not multiple condition involving columns
Is there anyway we can achieve this?

Yes, this is possible.
If you want query over all records you need to use scan, if you want to query only records with one specific partition key you can use query.
For both you can use a FilterExpression, which will filter the records after retrieving them from the database, but before returning them to the user (so beware, using a scan with this will read all your records).
A scan from the CLI could look like this:
aws dynamodb scan \
--table-name your-table \
--filter-expression "#i = #j" \
--expression-attribute-names '{"#i": "column1", "#j": "column2"}'

Create a Global Secondary Index with a partition key of 'Column1Value#Column2Value'
Then it's simply a matter of querying the GSI.

Related

Querying a Global Secondary Index of a DynamoDB table without using the partition key

I have a DynamoDB table with partition key as userID and no sort key.
The table also has a timestamp attribute in each item. I wanted to retrieve all the items having a timestamp in the specified range (regardless of userID i.e. ranging across all partitions).
After reading the docs and searching Stack Overflow (here), I found that I need to create a GSI for my table.
Hence, I created a GSI with the following keys:
Partition Key: userID
Sort Key: timestamp
I am querying the index with Java SDK using the following code:
String lastWeekDateString = getLastWeekDateString();
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
DynamoDB dynamoDB = new DynamoDB(client);
Table table = dynamoDB.getTable("user table");
Index index = table.getIndex("userID-timestamp-index");
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression("timestamp > :v_timestampLowerBound")
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString));
ItemCollection<QueryOutcome> items = index.query(querySpec);
Iterator<Item> iter = items.iterator();
while (iter.hasNext()) {
Item item = iter.next();
// extract item attributes here
}
I am getting the following error on executing this code:
Query condition missed key schema element: userID
From what I know, I should be able to query the GSI using only the sort key without giving any condition on the partition key. Please help me understand what is wrong with my implementation. Thanks.
Edit: After reading the thread here, it turns out that we cannot query a GSI with only a range on the sort key. So, what is the alternative, if any, to query the entire table by a range query on an attribute? One suggestion I found in that thread was to use year as the partition key. This will require multiple queries if the desired range spans multiple years. Also, this does not distribute the data uniformly across all partitions, since only the partition corresponding to the current year will be used for insertions for one full year. Please suggest any alternatives.
When using dynamodb Query operation, you must specify at least the Partition key. This is why you get the error that userId is required. (In the AWS Query docs)
The condition must perform an equality test on a single partition key value.
The only way to get items without the Partition Key is by doing a Scan operation (but this wont be sorted by your sort key!)
If you want to get all the items sorted, you would have to create a GSI with a partition key that will be the same for all items you need (e.g. create a new attribute on all items, such as "type": "item"). You can then query the GSI and specify #type=:item
QuerySpec querySpec = new QuerySpec()
.withKeyConditionExpression(":type = #item AND timestamp > :v_timestampLowerBound")
.withKeyMap(new KeyMap()
.withString("#type", "type"))
.withValueMap(new ValueMap()
.withString(":v_timestampLowerBound", lastWeekDateString)
.withString(":item", "item"));
Always good solution for any customised querying requirements with DDB is to have right primary key scheme design for GSI.
In designing primary key of DDB, the main principal is that hash key should be designed for partitioning entire items, and sort key should be designed for sorting items within the partition.
Having said that, I recommend you to use year of timestamp as a hash key, and month-date as a sort key.
At most, the number of query you need to make is just 2 at max in this case.
you are right, you should avoid filtering or scanning as much as you can.
So for example, you can make the query like this If the year of start date and one of end date would be same, you need only one query:
.withKeyConditionExpression("#year = :year and #month-date > :start-month-date and #month-date < :end-month-date")
and else like this:
.withKeyConditionExpression("#year = :start-year and #month-date > :start-month-date")
and
.withKeyConditionExpression("#year = :end-year and #month-date < :end-month-date")
Finally, you should union the result set from both queries.
This consumes only 2 read capacity unit at most.
For better comparison of sort key, you might need to use UNIX timestamp.
Thanks

DynamoDb CLI query across multiple ids

We are processing a huge file by splitting the file to multiple parts of 200 rows each (storing them in a S3 bucket and processing each file). Each part file has an ID (Partition Key) and the Timestamp is the Sort Key.
I'm looking to find the total count (across multiple IDs/part files) by different statuses (SUCCESS, FAILURE). For eg:
200000 records were successful (Status=Success) within the last 4 hours.
200 records were failed (Status=Failure) due to errorStatus "FAILURE :: Could not open JDBC Connection" within the past 4 hours
158 records were failed (Status=Failure) due to errorStatus "FAILURE :: Network failed" within the past 4 hours.
I'm able to get them by each bID separately. For eg.
aws dynamodb query --table-name abc1 --index-name abcGdx1 --projection-expression "TId" --key-condition-expression "BId = :bId and STimestamp between :sortkeyval1 and :sortkeyval2" --filter-expression "PStatus = :status and PStage = :stage" --expression-attribute-values "{\":bId\": {\"S\": \"c1234-5678-1000\"}, \":stage\": {\"S\": \"C_C\"}, \":status\": {\"S\": \"SUCCESS\"}, \":sortkeyval1\": {\"S\": \"2020-09-22T22:00:42.108-04:00\"}, \":sortkeyval2\": {\"S\": \"2020-09-23T18:52:55.724-04:00\"}}" --return-consumed-capacity TOTAL
Can you please help with an idea on how this can be achieved
It sounds like the status field is an attribute in your table, and not part of any primary key. If that's the case, you will not be able to use the query operation, since the query operation requires you to know the Primary Key of the item you're looking for (which sounds like your current solution).
You have one of two options:
Perform a scan operation across your entire table for each status you care about. Unlike the query operation, scan lets you search the entire table. It's commonly considered an operation of last resort, as it is slow and expensive compared to query operations. If you were to go this route, the CLI command would look like this:
aws dynamodb scan
--table-name abc1
--filter-expression "#status < :status"
--expression-attribute-names '{"#status": "PStatus"}'
--expression-attribute-values '{":status": {"S": "SUCCESS"}}'
Create a secondary index with the status field as your partition key. This will allow you to perform a fast query operation on all items with a given status.
For example, lets assume you have a table that looks something like this:
If you create a secondary index on the status field, your table would logically look like this:
Keep in mind that this is the same data as the first screenshot, just viewed from the perspective of the secondary index. Using this secondary index, you could issue a query operation to fetch all items with a given status:
aws dynamodb query
--table-name abc1
--index-name <YOUR STATUS INDEX NAME HERE>
--key-condition-expression "#pk = :pk"
--expression-attribute-names '{"#pk": "PStatus"}'
--expression-attribute-values '{":pk": {"S":"SUCCESS"}}'
The main difference between the two approaches in the scan vs query operation. A scan operation needs to look at your entire database to find what you are looking for, which is inefficient. The query operation looks up a specific primary key, which is much faster.

How should the Query's format be structured for sending a call with 'Greater than' condition in AWS DynamoDB?

I wanted to run a greater than Query against the primary key of my table. Later I came to know that greater than queries can only be executed on sort keys, and not on primary keys. So, I have now re-designed my table, and here's a screenshot of the new it : (StoreID is the Primary key, & OrderID is the Sort key)
How should I format the Query, if I want to run a query like return those items whose 'OrderID' > 1005?
More particularly, what should I mention in the Query condition to meet my requirements?
Thanks a lot!
You can use the following CLI command to run query "return those items in store with storeid='STR100' whose 'OrderID' > 1005".
aws dynamodb query --table-name <table-name> --key-condition-expression "StoreID = :v1 AND OrderID > :v2" --expression-attribute-values '{":v1": {"S": "STR100"}, ":v2": {"N": 1005}}'

Find number of objects inside an Item of DynomoDB table using Lamda function (Python/Node)

I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.

How can I check the partition list from Athena in AWS?

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?
In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'