I thought of this scenario in querying/scanning in DynamoDB table.
What if i want to get a single data in a table and i have 20k data in that table, and the data that im looking for is at 19k th row. Im using Scan with a limit 1000 for example. Does it consume throughput even though for the 19th time it does not returned any Item?. For Instance,
I have a User table:
type UserTable{
userId:ID!
username:String,
password:String
}
then my query
var params = {
TableName: "UserTable",
FilterExpression: "username = :username",
ExpressionAttributeValues: {
":username": username
},
Limit: 1000
};
How to effectively handle this?
According to the doc
A Scan operation always scans the entire table or secondary index. It
then filters out values to provide the result you want, essentially
adding the extra step of removing data from the result set.
Performance
If possible, you should avoid using a Scan operation on a large table
or index with a filter that removes many results. Also, as a table or
index grows, the Scan operation slows
Read units
The Scan operation examines every item for the requested values and can
use up the provisioned throughput for a large table or index in a
single operation. For faster response times, design your tables and
indexes so that your applications can use Query instead of Scan
For better performance an less read unit consumption i advice you create GSI using it with query
A Scan operation will look at entire table and visits all records to find out which of them matches your filter criteria. So it will consume throughput enough to retrieve all the visited records. Scan operation is also very slow, especially if the table size is large.
To your second question, you can create a Secondary Index on the table with UserName as Hash key. Then you can convert the scan operation to a Query. That way it will only consume throughput enough for fetching one record.
Read about Secondary Indices Here
Related
I use Dynamodb to store User data. Each user has many fields like age, gender, first/last name, address etc. I need to support a query API which response first, last, middle name only, without other fields.
In order to provide a better performance, I have two solutions:
Create a GSI which only includes those query fields. It will make each row very small.
Query the table with projection fields parameter including those query fields.
The item size is 1KB with 20 attributes. 1MB is the maximum data returned from one query. So I should receive 1024 items from querying the main index. If I use field projection to reduce the number of fields, will it give me more items in the response?
Based on dynamodb only response maximum 1MB data, which solution is better for me to use?
What you are trying to achieve is called "Sparse indexes".
Without knowing the table traffic pattern and historical amount of data. Another consideration is the amount of RCU (read capacity units) used for the operation.
FilterExpression is applied after a Query finishes, but before the results are returned.
Link to Documentation
With that in mind, the amount of RCU used by the FilterExpression solution will grow based on the number of fields/data the item has.
You are increasing your costs over time and need to worry about the item size and amount of fields it has.
A review of how RCU works:
DynamoDB read requests can be either strongly consistent, eventually consistent, or transactional.
A strongly consistent read request of an item up to 4 KB requires one read request unit.
An eventually consistent read request of an item up to 4 KB requires one-half read request unit.
A transactional read request of an item up to 4 KB requires two read request units.
Link to documentation
You can use GSI to have a separate throughput and control the used RCU capacity. The amount of data that will be transferred can be predictable. The RCU utilization will be based on the index entries only (first, last, middle and name)
You will need to update your application to use the new index and work with eventually consistent reads. GSI doesn't have support for a strongly consistent read.
Global secondary indexes support eventually consistent reads, each of which consume one half of a read capacity unit. This means that a single global secondary index query can retrieve up to 2 × 4 KB = 8 KB per read capacity unit.
For global secondary index queries, DynamoDB calculates the provisioned read activity in the same way as it does for queries against tables. The only difference is that the calculation is based on the sizes of the index entries, rather than the size of the item in the base table.
Link to documentation
Returning to your question: "which solution is better for me to use?"
Do you need strongly consistent reads? You need to use the table base index with FilterExpression. Otherwise, use GSI.
A good reading is this article: When to use (and when not to use) DynamoDB Filter Expressions
First of all it's important to note that DynamoDBs 1MB limit is not a blocker, it's there for performance reasons.
Your use case seems to want to unnecessarily reduce your payload to below the 1MB limit. However, you should just introduce pagination.
DynamoDB paginates the results from Query operations. With pagination, the Query results are divided into "pages" of data that are 1 MB in size (or less). An application can process the first page of results, then the second page, and so on.
The LastEvaluatedKey from a Query response should be used as the ExclusiveStartKey for the next Query request. If there is not a LastEvaluatedKey element in a Query response, then you have retrieved the final page of results. If LastEvaluatedKey is not empty, it does not necessarily mean that there is more data in the result set. The only way to know when you have reached the end of the result set is when LastEvaluatedKey is empty.
Ref
GSI or ProjectionExpression
This ultimately depends on what you need. For example, if you simply just want certain attributes and the base tables keys are suitable for your access patterns then I would 100% use a ProjectionExpression and paginate the results until I have all the data.
You should only create a GSI should the keys of the base table not suit your access pattern needs. GSI will increase your table costs and you will be storing more data and consuming extra throughput when your use-case doesn't need to.
Theoretical table with billions of entries.
Partition key is a unique uuid representing a given deviceId. There will be around 10k unique uuids.
Sort Key is a dateString for when the data was collected.
Each item has some data fields. There are dozens of fields such that making a GSI for each wouldn't be reasonable. For our example, let's say we are looking for the "dataOfInterest" field.
I'd like to search the DB for "all items where the dataOfInterest = 'foobar'" - and ideally do it within a date range. As far as I know, a scan operation is the only option. With billions entries... that's not going to be a fast process (though I understand I could split it out to run multiple operations at a time - it's stil going to eat RCU's like crazy)
Of note, I only care about a given uuid for each search, however. In other words, what I REALLY care about is "all items within a given partition where the dataOfInterest = 'foobar'". And futher, it'd be great to use the sort key to give "all items within a given partition where the dataOfInterest = 'foobar' that are between Jan 1 and Feb 28"
The scan operation allows you to limit the results with a filter expression such that I could get the results of just a single partition ... but it still reads the entire table and the filtering is done before returning the data to you. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html
Is there an AWS API that does a scan-like operation that reads only a given partition? Are there other ways to achieve this (perhaps re-architecting the DB?)
As #jarmod says, you can use a Query and specify the PK of the UUID. You can then either put the timestamp into the SK and filter for the dataOfInterest value (unindexed), or for more efficiency and to make everything indexed you can construct a composite SK which is dataOfInterest#timestamp and then do a range query on the SK of foobar#time1 to foobar#time2. That makes this query perfectly index optimized.
Course, this makes purely timestamp-based queries less simple. So you either do multiple queries for those or, if you want both queries efficient, setup this composite SK in a GSI and use that to resolve this query.
I need to be able to run some range-based queries on my DynamoDB table, such as int_attribute > 5, or starts_with(string_attribute, "foo"). These can all be answered by creating a global or local secondary index and then submitting a Query to these indexes. However, running a Query requires that you also provide a single value of the partition key to restrict the query set. Neither of these queries has a strict equality condition, so I am therefore considering giving all the items in my Dynamo table the same partition key, and distinguishing them only with the sort key. My dataset is will within the 10 GB partition size limit.
Are there any catastrophic issues that might occur if I do this?
Yes, you can create a GSI where every item goes under the same partition key. The thing to be aware of is you'll generally be putting all those writes into the same physical partition, each of which has a max update rate of 1,000 WCU.
If your update rate is below that, proceed. If your update rate is above that, you'll want to follow a pattern of sharding the GSI partition key value so it spreads across more partitions.
Say you require 10,000 WCU for the GSI. You can assign each item's GSI PK value to a random value-{x} where x is 0 to 9. Then yes, at query time you do 10 queries and put the results back together yourself. This approach can scale as large as you need.
I are trying to scan and update all entry with specific attribute value in my Amazon DynamoDB table, this will be one time operations and the parameter I am querying is not an index.
If I understood right my only option is to perform a scan of whole Amazon DynamoDB table and whenever that entry is encountered, I should update them.
My table size is around 2 GB and my table has over 8.5 million records.
Below is snippet of my script:
scan_kwargs = {
'FilterExpression': Key('someKey').eq(sometargetNumber)
}
matched_records = my_table.scan(**scan_kwargs)
print 'Number of records impacted by this operations: ' + str(matched_records['Count'])
user_response = raw_input('Would you like to continue?\n')
if user_response == 'y':
for item in matched_records['Items']:
print '\nTarget Record:'
print(item)
updated_record = my_table.update_item(
Key={
'sessionId': item['attr0']
},
UpdateExpression="set att1=:t, att2=:s, att3=:p, att4=:k, att5=:si",
ExpressionAttributeValues={
':t': sourceResponse['Items'][0]['att1'],
':s': sourceResponse['Items'][0]['att2'],
':p': sourceResponse['Items'][0]['att3'],
':k': sourceResponse['Items'][0]['att4'],
':si': sourceResponse['Items'][0]['att5']
},
ReturnValues="UPDATED_NEW"
)
print '\nUpdated Target Record:'
print(updated_record)
else:
print('Operation terminated!')
I tested the above script (some values are changed while posting on stackoverflow) in TEST environment (<1000 records) and everything works fine, but when I test them in PRODUCTION environment with 8.5 million records and 2 GB of data. The script scans 0 records.
Do I need to perform the scans differently and am I missing something? or its just the limitation of "scan" operation in dynamoDB?
Sounds like your issue is related to how DynamoDB filters data and paginates results. To review what is happening here, consider the order of operations when executing a DynamoDB scan/query operation while filtering. DynamoDB does the following in this order:
Read items from the table
Apply Filter
Return Results
DynamoDB query and scan operations return up to 1MB of data at a time. Anything beyond that will be paginated. You know your results are being paginated if DynamoDB returns a LastEvaluatedKey element in your response.
Filters apply after the 1MB limit. This is the critical step that often catches people off-guard. In your situation, the following is happening:
You execute a scan operation that reads 1MB of data from the table.
You apply a filter to the 1MB response, which results in all of the records in the first step being eliminated from the response.
DDB returns the remaining items with a LastEvaluatedKey element, which indicates there is more data to search.
In other words, your filter isn't applying to the entire table. It's applying to 1MB of the table at a time. In order to get the results you are looking for, you are going to need to execute the scan operation repeatedly until you reach the last "page" of the table.
I want to know whether I have to use a dynamodb "Scan" operation for getting a list of all hash key values in a dynamodb table or is there an another "less-expensive" approach to do that. I have tried with a "Query" operation, but it was unsuccessful in my case, since I have to define the table hash key to use this operation. I just want to get a list of all hash key values in the table.
Yes, you need to use the scan method to access every item in the table. You can reduce the size of the data returned to you by setting the attributes_to_get attribute to only what you need(*) -- e.g. just the hash key value. Also, note that scan operations are eventually consistent, so if this database is actively growing, your result set may not include the most recent items added to the table.
(*) This will reduce the amount of bandwidth consumed and make the result less resource-intensive to process on the application side, but it will not reduce the amount of throughput that you are charged. Scan operation charges based on size of the entire item, not just attributes that get returned.
Unfortunately to get a list of hash key values you have to perform a Scan operation. What is your use case? Typically, the application should keep track of hash key values since there needs to be an evenly distributed workload. As a result, a Scan operation for this purpose should not happen frequently.
Edit: note that if you filter out the result using attributes_to_get or projection expression, it will help make the results cleaner but it will not reduce the amount of throughput that you are charged. Scan operation charges based on size of the entire item, not just attributes that get returned.