Amazon DynamoDB scan is not scanning complete table - amazon-web-services

I are trying to scan and update all entry with specific attribute value in my Amazon DynamoDB table, this will be one time operations and the parameter I am querying is not an index.
If I understood right my only option is to perform a scan of whole Amazon DynamoDB table and whenever that entry is encountered, I should update them.
My table size is around 2 GB and my table has over 8.5 million records.
Below is snippet of my script:
scan_kwargs = {
'FilterExpression': Key('someKey').eq(sometargetNumber)
}
matched_records = my_table.scan(**scan_kwargs)
print 'Number of records impacted by this operations: ' + str(matched_records['Count'])
user_response = raw_input('Would you like to continue?\n')
if user_response == 'y':
for item in matched_records['Items']:
print '\nTarget Record:'
print(item)
updated_record = my_table.update_item(
Key={
'sessionId': item['attr0']
},
UpdateExpression="set att1=:t, att2=:s, att3=:p, att4=:k, att5=:si",
ExpressionAttributeValues={
':t': sourceResponse['Items'][0]['att1'],
':s': sourceResponse['Items'][0]['att2'],
':p': sourceResponse['Items'][0]['att3'],
':k': sourceResponse['Items'][0]['att4'],
':si': sourceResponse['Items'][0]['att5']
},
ReturnValues="UPDATED_NEW"
)
print '\nUpdated Target Record:'
print(updated_record)
else:
print('Operation terminated!')
I tested the above script (some values are changed while posting on stackoverflow) in TEST environment (<1000 records) and everything works fine, but when I test them in PRODUCTION environment with 8.5 million records and 2 GB of data. The script scans 0 records.
Do I need to perform the scans differently and am I missing something? or its just the limitation of "scan" operation in dynamoDB?

Sounds like your issue is related to how DynamoDB filters data and paginates results. To review what is happening here, consider the order of operations when executing a DynamoDB scan/query operation while filtering. DynamoDB does the following in this order:
Read items from the table
Apply Filter
Return Results
DynamoDB query and scan operations return up to 1MB of data at a time. Anything beyond that will be paginated. You know your results are being paginated if DynamoDB returns a LastEvaluatedKey element in your response.
Filters apply after the 1MB limit. This is the critical step that often catches people off-guard. In your situation, the following is happening:
You execute a scan operation that reads 1MB of data from the table.
You apply a filter to the 1MB response, which results in all of the records in the first step being eliminated from the response.
DDB returns the remaining items with a LastEvaluatedKey element, which indicates there is more data to search.
In other words, your filter isn't applying to the entire table. It's applying to 1MB of the table at a time. In order to get the results you are looking for, you are going to need to execute the scan operation repeatedly until you reach the last "page" of the table.

Related

Dynamo DB queries on Secondary Index

I have a use case where I am fetching data on certain items (unique itemID) multiple times a day (identified by day_BatchTime) and storing them in DyanmoDB. My composite primary key consists of itemID & day_BatchTime. I have set itemID as partition key and day_BatchTime as the sort key.
But I need to report for each day on a daily basis. So I tried setting up a global secondary index as feedDate. But the query on this is working a bit slow in AWS console. Also, I am getting an error when executing the below query in lambda using Python. Below are the relevant snippets:
response = table.query(KeyConditionExpression=Key('feedDate').eq('18-03-2022'))
"errorMessage": "An error occurred (ValidationException) when calling the Query operation:
Query condition missed key schema element: itemID"
The table has about 53,000 items with global secondary index populated for about 31,000 items and I am querying for about 6000 items that are updated in a day. The query execution time appears to much higher compared to what one would normally expect.
Below are my global secondary index details.
Name: feedDate-index
Status: Active
Partition key: feedDate (String)
Sort key: -
Read capacity Range: 1 - 10
Auto scaling at 70%
Current provisioned units: 1
Write capacity Range: 1 - 10
Auto scaling at 70%
Current provisioned units: 1
Size 8.9 megabytes, Item count 31,737
Please let me know if I am missing something.
As pointed out by #hoangdv in the comments, you forgot to add the index name to the query. By default, Query reads from the base table, so you need to explicitly point it to the global secondary index.
Something like this should do the trick:
response = table.query(
IndexName="feedDate-index",
KeyConditionExpression=Key('feedDate').eq('18-03-2022')
)
Concerning your perceived performance issues, those are difficult to address without concrete numbers and data. On a general note, the Query API returns up to 1000 items or 1MB of data per API call, then a follow-up API call with the pagination token (ExclusiveStartKey) needs to be performed. You're looking at at least six subsequent API calls for your 6000 items.
The source of the query and complexity of the data may also impact performance. For example, a tiny Lambda function with 128 MB RAM will take a lot longer to deserialize items than one with more performance. I wrote a blog about this topic a while ago if you're curious (disclaimer: written by me, relevant to the topic).

AWS DynamoDB - To use a GSI or Scan if I just wish to query the table by Date

I feel like I'm thinking my self in circles here. Maybe you all can help :)
Say I have this simple table design in DynamoDB:
Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated
Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).
Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.
Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).
Assume a single entity will be performing this 'read' use case (no need for large read capacity).
Options I can think of:
Do a scan.
Pro: can leverage parallels scans to scale in the future.
Con: wastes a lot of money reading rows that do not match the filter criteria.
Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).
Pro: Only queries the rows matching the filter. No wasted money.
Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.
Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.
Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:
PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created
PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created
By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.
You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.
A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.
You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.
Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.
PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.
Definitely stay away from scan...
I'd look at a GSI with
PK: YYYY-MM-DD-HH
SK: MM-SS.mmmmmm
Now to get the records updated in the last two hours, you need only make three queries.

DynamoDB update one column of all items

We have a huge DynamoDB table (~ 4 billion items) and one of the columns is some kind of category (string) and we would like to map this column to either new one category_id (integer) or update existing one from string to int. Is there a way to do this efficiently without creating new table and populating it from beginning. In other words to update existing table?
Is there a way to do this efficiently
Not in DynamoDB, that use case is not what it's designed for...
Also note, unless you're talking about the hash or sort key (of the table or of an existing index), DDB doesn't have columns.
You'd run Scan() (in a loop since it only returns 1MB of data)...
Then Update each item 1 at a time. (note could BatchUpdate of 10 items at a time, but that save just network overhead..still does 10 individual updates)
If the attribute in question is used as a key in the table or an existing index...then a new table is your only option. Here's a good article with a strategy for migrating a production table.
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.

Dynamo DB Query and Scan Behavior Question

I thought of this scenario in querying/scanning in DynamoDB table.
What if i want to get a single data in a table and i have 20k data in that table, and the data that im looking for is at 19k th row. Im using Scan with a limit 1000 for example. Does it consume throughput even though for the 19th time it does not returned any Item?. For Instance,
I have a User table:
type UserTable{
userId:ID!
username:String,
password:String
}
then my query
var params = {
TableName: "UserTable",
FilterExpression: "username = :username",
ExpressionAttributeValues: {
":username": username
},
Limit: 1000
};
How to effectively handle this?
According to the doc
A Scan operation always scans the entire table or secondary index. It
then filters out values to provide the result you want, essentially
adding the extra step of removing data from the result set.
Performance
If possible, you should avoid using a Scan operation on a large table
or index with a filter that removes many results. Also, as a table or
index grows, the Scan operation slows
Read units
The Scan operation examines every item for the requested values and can
use up the provisioned throughput for a large table or index in a
single operation. For faster response times, design your tables and
indexes so that your applications can use Query instead of Scan
For better performance an less read unit consumption i advice you create GSI using it with query
A Scan operation will look at entire table and visits all records to find out which of them matches your filter criteria. So it will consume throughput enough to retrieve all the visited records. Scan operation is also very slow, especially if the table size is large.
To your second question, you can create a Secondary Index on the table with UserName as Hash key. Then you can convert the scan operation to a Query. That way it will only consume throughput enough for fetching one record.
Read about Secondary Indices Here

How to update all records in DynamoDB?

I am new to nosql / DynamoDB.
I have a list of ~10 000 container-items records, which is updated every 6 hours:
[
{ containerId: '1a3z5', items: ['B2a3, Z324, D339, M413'] },
{ containerId: '42as1', items: ['YY23, K132'] },
...
]
(primary key = containerId)
Is it viable to just delete the table, and recreate with new values?
Or should I loop through every item of the new list, and conditionally update/write/delete the current DynamoDB records (using batchwrite)?
For this scenario batch update is better approach. You have 2 cases:
If you need to update only certain records than batch update is more efficient. You can scan the whole table and iterate thought the records and only update certain records.
If you need to update all the records every 6 hours batch update will be more efficient, because if you drop the table and recreate table, that also means you have to recreate indexes and this is not a very fast process. And after you recreate table you still have to do the inserts and in the meantime you have to keep all the records in another database or in-memory.
One scenario where deleting the whole table is a good approach if you need to delete all the data from the table with thousands or more records, than its much faster to recreate table, than delete all the records though API.
And one more suggestion have you considered alternatives, because your problem does not look like a good use-case for DynamoDB. For example MongoDB and Cassandra support update by query out of the box.
If the update touches some but not all existing items and if partial update of 'items' is possible then you have no choice but to do a per record operation. And this would be true even with a more capable database.
You can perhaps speed it up by retrieving only the existing containerIds first so based on that set you know which to do update versus insert on. Alternately you can do a batch retrieve by ids using the ids from the set of updates and which every ones do not return a result are the ones you have to insert and the ones where you do are the ones to update.