DynamoDB scan performance issue - amazon-web-services

I am having problem with the performance of the DynamoDB and i want to clear something that i a little bit of confused.
When doing scan for a 100 of records in the table books with condition using Attr (e.g. Attr=('Author').eq('some-well-known-author-with-many-books-written')). If the the Author has a 20 records found in the table does DynamoDB still scan the other 80 records?
How does pagination works when doing scan?
What is the consequences of consuming more than your allocated RCU and WCU?

Answering your questions in order:
Yes. Scan means an iteration over all records in a table. If Author is your partition-key and you need to find all books written by her, you should Query (not Scan), in which case it won't look at other Authors.
Pagination works as expected: if you have n records in your table, and you Scan with limit set to m, DynamoDB will Scan m records while returning data for each page.
DynamoDB will throttle your requests if you try to go beyond configured RCUs or WCUs. There'll be no cost impact, if that's what you are worried about.

Related

Dynamodb one bulk scan vs many single gets

Suppose I have a lambda function and as the event param I get about 50 primary ids that I have to look for inside a dynamodb table, what would be the better way to do it - 50 get queries each one by different primary id OR one scan and then comparing the scan primary ids results to the primary ids recieved as param?
I think 50 get query would be better on the performance side because if tomorrow I will have one million records it would be a waste of time and memory to scan them all and then filter only 50 of them but on the other side isn't making 50 requests to dynamodb could have performance issues and require more provisioning ?
You're right that a Scan operation, assuming you will only need to read 50 records out of a million, is the worst possible solution. It will be very slow, and will cost you a pretty penny because when you scan, you pay Amazon to read all your data - even if you filter most of it out.
Making 50 separate GetItem requests isn't so bad - it's certainly better than a scan. You only pay Amazon for the actual retrieved item - you don't pay more because it's 50 separate requests. Of course, if you don't want huge latency, don't just start these requests one after another - start them all in parallel.
But for this use-case, DynamoDB provides an even better operation BatchGetItem. With this operation you give DynamoDB the list of 50 required keys, in just one HTTP request, and it will fetch all of them (in parallel) and return all the responses to you. It seems to be that BatchGetItem is the best fit for your use case.

handling dynamo db read and write units

I am using dynamo db as back end database in my project, I am storing items in the table with each of size 80 Kb or more(contains nested JSON), and my partition key is a unique valued column(unique for each item). Now i want to perform pagination on this table i.e., my UI will provide(start-Integer, limit-Integer and type-2 string constants) and my API should retrieve the items from dynamo db based on the provided input query parameters from UI. I am using SCAN method from boto3 (python SDK) but this scan is reading all the items from my table prior to considering my filters and causing provision throughput error, but I cannot afford to either increase my table's throughput or opt table auto-scaling. Is there any way how my problem can be solved? Please give your suggestions
Do you have a limit set on your scan call? If not, DynamoDB will return you 1MB of data by default. You could try using limit and some kind of sleep or delay in your code, so that you process your table at a slower rate and stay within your provisioned read capacity. If you do this, you'll have to use the LastEvaluatedKey returned to you to page through your table.
Keep in mind that just to read a single one of your 80kb items, you'll be using 10 read capacity units, and perhaps more if your items are larger.

How to solve "hot" hash key issue (space skewed data) in DynamoDB?

For example, I am using DynamoDB to store product purchase records. The hash key is product ID and the range key is purchase time.
Some popular products can have a lot of purchase records (space skewed) so that read/write requests can get throttled for "hot" partitions while other partitions are not using full throughput.
How to solve this problem and still be able to get latest purchase records? Thanks!
You can use a cache solution in order to achieve this.
You can follow the guidelines when designing a table to cache the popular items:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.CachePopularItems
My solution for this is to use elasticache (Redis), you can create a list that represent the last purchases per product and trim the last 100 purchases for each product, for example:
LPUSH product:100 2016-08-13:purchaseId
LTRIM product:100 0 99
Will trim the list to last 100 items.
I hope this help...

Do Global Secondary Index (GSI) in DynamoDB impact tables provision capacity

I have queries for 2 use cases with different throughput needs being directed to one DynamoDB table.
First use case needs read/write only using primary key, but needs at least 1700/sec write and 8000/sec read
Second Use case utilizes every GSI, but queries that use GSI are few and far between. Less than 10 queries per minute.
So my provisioned capacity for GSI will be far less than what is provisioned for primary key. Does this mean when I do a write on the table, the performance upper bound is what I have provisioned for GSI?
Asked AWS Support same question, Below is their answer:
Your question is worth asking. In the scenario you mention your read/write request in GSI will be throttled, and 10 writes / min will be the effective limit. This will create issues when ever you update your primary table, the updates will get mirrored to GSI. So either you should Provision similar write capacity to GSI or do not keep attribute in GSI that will get updated frequently.
Here is link to our documentation that will help you :
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations
I think so. When you add new items they will need to be added to the GSI index as well, so the same capacity is needed there as well
In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.
There're more details and use-cases here:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.ThroughputConsiderations

Questions on dynamoDB query result

I'm currently thinking about how I should write my queries for DynamoDB. I have questions below which I hope someone could advise me on it.
Given scenarios: I have a million records on a table.
Questions:
When I query, can I fetch 1000 records in batches instead of 1 million records at one go?
Is the time taken to fetch 1000 records similar to 1 million records?
What happens if I hit the limit of 1MB or my throughput for the table so that I can fetch again for the remaining records?
Thanks in advance!
1) Yes you can specify a limit for a query (1000 in your case).
2) No. The time is not the same. More records will mean more time - because you will need to fetch more pages (most time will be spend in network roundtrips)
3) If you hit the 1MB limit, Dynamo will provide a LastEvaluatedKey. You repeat the request and pass the LastEvaluatedKey until you fetch everything (you are basically fetching in a loop).
If you hit the provisioned throughput limits, you either increase the limits or you back off (i.e. you need to regulate your consumption to stay within the limits)
Reference: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html