DynamoDb CLI query across multiple ids - amazon-web-services

We are processing a huge file by splitting the file to multiple parts of 200 rows each (storing them in a S3 bucket and processing each file). Each part file has an ID (Partition Key) and the Timestamp is the Sort Key.
I'm looking to find the total count (across multiple IDs/part files) by different statuses (SUCCESS, FAILURE). For eg:
200000 records were successful (Status=Success) within the last 4 hours.
200 records were failed (Status=Failure) due to errorStatus "FAILURE :: Could not open JDBC Connection" within the past 4 hours
158 records were failed (Status=Failure) due to errorStatus "FAILURE :: Network failed" within the past 4 hours.
I'm able to get them by each bID separately. For eg.
aws dynamodb query --table-name abc1 --index-name abcGdx1 --projection-expression "TId" --key-condition-expression "BId = :bId and STimestamp between :sortkeyval1 and :sortkeyval2" --filter-expression "PStatus = :status and PStage = :stage" --expression-attribute-values "{\":bId\": {\"S\": \"c1234-5678-1000\"}, \":stage\": {\"S\": \"C_C\"}, \":status\": {\"S\": \"SUCCESS\"}, \":sortkeyval1\": {\"S\": \"2020-09-22T22:00:42.108-04:00\"}, \":sortkeyval2\": {\"S\": \"2020-09-23T18:52:55.724-04:00\"}}" --return-consumed-capacity TOTAL
Can you please help with an idea on how this can be achieved

It sounds like the status field is an attribute in your table, and not part of any primary key. If that's the case, you will not be able to use the query operation, since the query operation requires you to know the Primary Key of the item you're looking for (which sounds like your current solution).
You have one of two options:
Perform a scan operation across your entire table for each status you care about. Unlike the query operation, scan lets you search the entire table. It's commonly considered an operation of last resort, as it is slow and expensive compared to query operations. If you were to go this route, the CLI command would look like this:
aws dynamodb scan
--table-name abc1
--filter-expression "#status < :status"
--expression-attribute-names '{"#status": "PStatus"}'
--expression-attribute-values '{":status": {"S": "SUCCESS"}}'
Create a secondary index with the status field as your partition key. This will allow you to perform a fast query operation on all items with a given status.
For example, lets assume you have a table that looks something like this:
If you create a secondary index on the status field, your table would logically look like this:
Keep in mind that this is the same data as the first screenshot, just viewed from the perspective of the secondary index. Using this secondary index, you could issue a query operation to fetch all items with a given status:
aws dynamodb query
--table-name abc1
--index-name <YOUR STATUS INDEX NAME HERE>
--key-condition-expression "#pk = :pk"
--expression-attribute-names '{"#pk": "PStatus"}'
--expression-attribute-values '{":pk": {"S":"SUCCESS"}}'
The main difference between the two approaches in the scan vs query operation. A scan operation needs to look at your entire database to find what you are looking for, which is inefficient. The query operation looks up a specific primary key, which is much faster.

Related

How should the Query's format be structured for sending a call with 'Greater than' condition in AWS DynamoDB?

I wanted to run a greater than Query against the primary key of my table. Later I came to know that greater than queries can only be executed on sort keys, and not on primary keys. So, I have now re-designed my table, and here's a screenshot of the new it : (StoreID is the Primary key, & OrderID is the Sort key)
How should I format the Query, if I want to run a query like return those items whose 'OrderID' > 1005?
More particularly, what should I mention in the Query condition to meet my requirements?
Thanks a lot!
You can use the following CLI command to run query "return those items in store with storeid='STR100' whose 'OrderID' > 1005".
aws dynamodb query --table-name <table-name> --key-condition-expression "StoreID = :v1 AND OrderID > :v2" --expression-attribute-values '{":v1": {"S": "STR100"}, ":v2": {"N": 1005}}'

Filter Dynamo DB rows

I want to filter all dynamo db rows where 2 columns have same value
table = client.Table('XXX')
response = table.query(
KeyConditionExpression=Key('column1').eq(KeyConditionExpression=Key('column2'))
)
this is wrong as we can't pass KeyConditionExpression inside eq statement. I don't want to scan through all the rows and filter the rows.
Scanned through multipole resources and answers but every resources talks about the multiple column checking with some value not multiple condition involving columns
Is there anyway we can achieve this?
Yes, this is possible.
If you want query over all records you need to use scan, if you want to query only records with one specific partition key you can use query.
For both you can use a FilterExpression, which will filter the records after retrieving them from the database, but before returning them to the user (so beware, using a scan with this will read all your records).
A scan from the CLI could look like this:
aws dynamodb scan \
--table-name your-table \
--filter-expression "#i = #j" \
--expression-attribute-names '{"#i": "column1", "#j": "column2"}'
Create a Global Secondary Index with a partition key of 'Column1Value#Column2Value'
Then it's simply a matter of querying the GSI.

Find number of objects inside an Item of DynomoDB table using Lamda function (Python/Node)

I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.

Best method to extract data from dynamoDb and move it to another table

I have a table of 500gb. I want to transfer the data to another table based on the timestamps.
There are several items in table and I want only latest entry of every item in another table.
Considering the size of table, can anyone recommend best aws service to get it done fast and easy?
I have come across aws glue, hivecopyactivity. Are this the best solution or is there any other service I can use?
(assuming you now can add a Global secondary indexes (GSI) on that table, that is: you currently have < 5 GSIs)
Define a new GSI on your table. The GSI's partition key will be x. The GSI's sort key will be timestamp. Once you have that GSI defined you can do a query on that index with ScanIndexForward set to false to get the most recent item first. You need to supply the value of x you are interested at. In the following example request it is simply set to 'abc'
{
"TableName": "<your-table-name>",
"IndexName": "<your-GSI-name>",
"KeyConditionExpression": "x = :argx",
"ExpressionAttributeValues": {
":argx": {"S": "abc"}
},
"ScanIndexForward": false,
"Limit": 1
}
This query looks at items with a given x value (as set in the ExpressionAttributeValues field) sorted in descending order (by the GSI's sort key, which is the timestamp field) and picks the first one (Limit is set to 1). As long as you do not need filtering (the FilterExpression field is empty) then you will get the result that you need by issuing a single Query request.
If you do want to use filtering you will need to do multiple requests and unset the Limit field (i.e., use its default value). See this answer for further details on those subtleties.

DynamoDB schema design to support lookup by ID and timestamp of item?

I need to design a DynamoDB schema to store items whose attributes are:
tid: a UUID string which is a unique identifier of the item
timestamp: an ISO-8601-formatted string representing a date and time related to the item
Other stuff...
and for the following query (really want to avoid having to do any scans) patterns:
Query by tid
Query by exact timestamp, and by relational ordering expressions (e.g., <=, BETWEEN, etc.) on timestamp. I.e., query all items from a certain date-time range without knowing their tids in advance.
Is this possible to do efficiently in DynamoDB, or is there perhaps another AWS solution that would serve me better?
Given a DynamoDB table as follows:
partition key: tid, type string
sort key: timestamp, type string
You can query on:
tid = 5
tid = 5, timestamp between 2018-12-21T09:00:00Z and 2018-12-21T15:00:00Z
Try it out using the awscli, for example to query all items with tid=5:
aws dynamodb query \
--table-name mytable \
--key-condition-expression "tid = :tid" \
--expression-attribute-values '{":tid":{"S":"5"}}'
To query all items for tid=5 and timestamp between 09:00 and 15:00 on 2015-12-21:
aws dynamodb query \
--table-name mytable \
--key-condition-expression "tid = :tid AND #ts BETWEEN :ts1 AND :ts2" \
--expression-attribute-values '{":tid":{"S":"5"}, ":ts1":{"S":"2015-12-21T09:00:00Z"}, ":ts2":{"S":"2015-12-21T15:00:00Z"}}' \
--expression-attribute-names '{"#ts":"timestamp"}'
Note: because timestamp is a reserved keyword in DynamoDB, you have to escape it using the expression attribute names.
You could also create the timestamp attribute as a number and then store epoch times, if you prefer.
To query all items with timestamp between 09:00 and 15:00 on 2015-12-21, regardless of tid, cannot be done with the same partition/sort key schema. You would need to add a Global Secondary Index something like this:
GSI partition key: yyyymmdd, type string
GSI sort key: timestamp, type string
Now you can query for items with a given timestamp range, as long as they're on the same day (they have the same YYYYMMDD, which might be a reasonable restriction). Or you could go to YYYYMM as the partition key allowing a wider timestamp range. At this point you really need to understand the use cases for queries to decide if YYYYMMDD (restricting queries to a single day) is right. See How to query DynamoDB by date with no obvious hash key for more on this idea.