How to solve "hot" hash key issue (space skewed data) in DynamoDB? - amazon-web-services

For example, I am using DynamoDB to store product purchase records. The hash key is product ID and the range key is purchase time.
Some popular products can have a lot of purchase records (space skewed) so that read/write requests can get throttled for "hot" partitions while other partitions are not using full throughput.
How to solve this problem and still be able to get latest purchase records? Thanks!

You can use a cache solution in order to achieve this.
You can follow the guidelines when designing a table to cache the popular items:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.CachePopularItems
My solution for this is to use elasticache (Redis), you can create a list that represent the last purchases per product and trim the last 100 purchases for each product, for example:
LPUSH product:100 2016-08-13:purchaseId
LTRIM product:100 0 99
Will trim the list to last 100 items.
I hope this help...

Related

Access Patterns Dynamo DB possible with a NoSQL Database?

I am relativly new to NoSQL Database Structures.
In my thinking the following access patterns show relations and analytical queries. But then a SQL Data Structure would be the better approach instead of a NoSQL Structure.
I wondering whether the following access patterns are even possible with a dynamoDB or maybe one has to get the data first from the DynamoDB into e.g. a lambda to process them.
Get all customers of this month that increased their spending higher then 25% compared to the last month
Get anual spending of a customer (Only monthly spending is entered into DynamoDB)
Get customers that has a spending over 0 in a specific timeframe
Get all orders in a specific timeframe and where the customer who placed the order has the following attributes (female, 25 yars, 170cm tall)
Get all active customers, active supplyers software, active supplyers raw materials for a given timeframe

How to query data in AWS AppSync in a specific range then sort its result by another key?

I create a temple name BlogAuthor in AWS DynamoDB with following structure:
authorId | orgId | age |name
Later I need to make a query like this: get all authors from organization id = orgId123 with age between 30 and 50, then sort their name in alphabet order.
I'm not sure it's possible to perform such query in DynamoDB (later I'll apply it in AppSync), hence the first solution is to create an index (GSI) with partitionKey=orgId, sortKey=age (final name is orgId-age-index).
But next, when try to query in DynamoDB, set partitionKey orgId=orgId123, sortKey age=[30;50] and no filter; then I can have a list of authors. However, there is no way to sort that list by name from above query.
I retry another solution by create new index with partitionKey=orgId and sortKey=name. Then, query (not scan) in DynamoDB with partitionKey orgId=orgId123, set empty sortKey value (because we only want to sort by name instead of getting a specific name), and filter age in range [30;50]. This solution seems works, however I notice the filter is applied on the result list - for example the result list with 100 items, but after apply filter by age, then may by 70 items remaining, or nothing. But I always hope it returns 100 items.
Could you please tell me is there anything wrong with my approaches? Or, is it possible to make such query in DynamoDB?
Another (small) question is when connect that table to an AppSync API: if it's not possible to perform such query, then it's not possible for such query in AppSync too?
You are not going to be able to do everything you want in a single DynamoDB query.
Option 1:
You can do what you want as long as you are ok with sorting objects on the client. This would work for organizations with a relatively small number of people.
Pros:
Allows you to efficiently query users in a particular organization between a range of users.
Cons:
Results are not sorted by name on the server.
Option 2:
Pros:
Allows you to paginate through users at an organization that are ordered by the name.
Cons:
You cannot efficiently get all users in an organization within an age range. You would effectively be scanning the index and would need multiple round trip calls.
Option 3:
A third option, would be to stream information from DynamoDB into ElasticSearch using DynamoDB streams and AWS Lambda. Once the data is in Elasticsearch, you can do much more advanced queries. You can see more information on the Elasticsearch search APIs here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html.
Pros:
Much more powerful query engine.
Cons:
More overhead w/ the DynamoDB stream and AWS Lambda function.

DynamoDB scan performance issue

I am having problem with the performance of the DynamoDB and i want to clear something that i a little bit of confused.
When doing scan for a 100 of records in the table books with condition using Attr (e.g. Attr=('Author').eq('some-well-known-author-with-many-books-written')). If the the Author has a 20 records found in the table does DynamoDB still scan the other 80 records?
How does pagination works when doing scan?
What is the consequences of consuming more than your allocated RCU and WCU?
Answering your questions in order:
Yes. Scan means an iteration over all records in a table. If Author is your partition-key and you need to find all books written by her, you should Query (not Scan), in which case it won't look at other Authors.
Pagination works as expected: if you have n records in your table, and you Scan with limit set to m, DynamoDB will Scan m records while returning data for each page.
DynamoDB will throttle your requests if you try to go beyond configured RCUs or WCUs. There'll be no cost impact, if that's what you are worried about.

Searching items in Amazon DynamoDB

Is it possible to get the items from DynamoDB table by querying against any attribute of the table except the primary key? In my table I have product ID as the Hash key and I have not specified any Range key. I want to add support for filters based on various attributes such as product price, product brand, available units in stock etc. And while filtering I do not want to provide the product ID since in most of the cases I may not know the product ID. Coming from a SQL background I was assuming DynamoDB to also have some sort of 'where' clause to list records that match certain criteria/value of attribute(s). However, so far I haven't had success.
After going through the Query and Scan documentation also,I couldn't figure out how I can optimally use these operations to suit my needs. And how can I perform search/filter in my application without burning through my provisioned throughput capacity.
Any ideas as to how this can be done?
Create a Global Secondary Index on the attributes you want to query on. It will have its own capacity in both read and write units as well as other considerations. If you need to add indices to an existing table, AWS preannounced Online Indexing a couple months ago so look forward to hearing more news on when that is released. If you need more than just simple queries against these indices (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) you may want to consider using an search solution, such as AWS Cloud Search

Django webapp - tracking financial account information

I need some coding advice as I am worried that I am creating, well, bloated code that is inefficient.
I have a webapp that keeps track of a company's financial data. I have a table called Accounts with a collection of records corresponding to the typical financial accounts such as revenue, cash, accounts payable, accounts receivable, and so on. These records are simply name holders to be pointed at as foreign keys.
I also have a table called Account_Transaction which records all the transactions of money in and out of all the accounts in Accounts. Essentially, the Account_Transaction table does all the heavy lifting while pointing to the various accounts being altered.
For example, when a sale is made, two records are created in the Account_Transaction table. One record to increase the cash balance and a second record to increase the revenue balance.
Trans Record 1:
Acct: Cash
Amt: 50.00
Date: Nov 1, 2011
Trans Record 2:
Acct: Revenue
Amt: 50.00
Date: Nov 1, 2011
So now I have two records, but they each point to a different account. Now if I want to view my cash balance, I have to look at each Account_Transaction record and check if the record deals with Cash. If so, add/subtract the amount of that record and move to the next.
During a typical business day, there may be upwards of 200-300 transactions like the one above. As such, the Account_Transaction table will grow pretty quickly. After a few months, the table could have a few thousand records. Granted this isn't much for a database, however, every time the user wants to know the current balance of, say, accounts receivable, I have to traverse the entire Account_Transaction table to sum up all records that deal with the account name "Accounts Receivable".
I'm not sure I have designed this in the most optimal manner. I had considered creating a distinct table for each account (one for "Cash", another for "Accounts Receivable" another for "Revenue" etc...), but with that approach I was creating 15-20 tables with the exact same parameters, other than their name. This seemed like poor design so I went with this Account_Transaction idea.
Does this seem like an appropriate way to handle this kind of data? Is there a better way to do this that I should really be adopting?
Thanks!
Why do you need to iterate through all the records to figure out the status of Accounts Receievable accounts? Am I missing something in thinking you can't just use a .filter within the Django ORM to selectively pick the records you need?
As your records grow, you could add some date filtering to your reports. In most cases, your accountant will only want numbers for this quarter, month, etc., not entire historic data.
Add an index to that column to optimize selection and then check out Djangos aggregation to Sum up values from your database.
Finally, you could do some conservative caching to speed up things for "quick view" style reports where you just want a total number very quickly, but you need to be careful with this to not have false positives, so reseting that cache on any change to the records would be a must.
Why don't you keep track of the exact available amount in the Account table? The Account_Transaction could only be used to view transaction history.