what is the right way to query different filters on dynamodb? - amazon-web-services

I save my order data on dyanmodb table. And the partition key is orderId, sort key is timestamp. Each order has many other attributes like category, userName, price, items, status`. I am going to build a filter service to let clients query order based on these attributes. Also I'd like to add a limit on the query for pagination. But I find some limitations on dynamodb.
In order to support querying different fields, I have two options:
Create GSI for each attribute. It is very expensive but it supports query each attribute very performance. This solution doesn't support combine multiple attributes in the filter.
Attach a filter expression on the SCAN to include attribute condition. SCAN is not very performance in the first place. Also the filter expression is applied after limits. Which means it is very likely to response less than users request limits.
so what is the good way to achieve this in dynamodb?

There is unfortunately no magic way to solve your problems. There is no DynamoDB feature which you missed. Indeed, as you said, making each of the attributes available for efficient queries requires a GSI which will cost you additional money - but that's reasonable. Indeed, as you said, there is no efficient way to search for an intersection of requirements on two different attribute. And indeed, the "limit" feature doesn't quite do what you want and you'll need to emulate your page size need in the client code (asking for more pages until your desired amount is recieved), potentially with unacceptably high latency.
It sounds that what you really need is a search engine. These have exactly the features that you asked for. You'll still be paying for these features (indexing of individual columns still takes up CPU and disk space, intersection of multiple attribute searches still requires significant work at query time) but search engines are designed for exactly these operations, and do them more efficiently and with lower latency (which is important for interactive searches, which are the bread-and-butter of search engines).

You can add the limit for pagination using the limit attribute in the query. But can you please be more specific about your access patterns, is your clients going to query all the orders or only the orders belonging to them ?

Related

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

I am new to AWS. while reading the docs here and example I came to know that sort key is not only use to sort the data in partitions but also used to enhance the searching criteria on dynamoDB table.But the same we can do with the help of filterCondition. So what is the difference,
and also acc. to example given we can use sort/range key in withKeyConditionExpression("CreateDate = :v_date and begins_with(IssueId, :v_issue)")
but when I tried it gave me exception
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Query key condition not supported
Thanks
To limit the Items returned rather than returning all Items with a particular HASH key.
There are two different ways we can handle this
The ideal way is to build the element we want to query into the RANGE key. This allows us to use Key Expressions to query our data, allowing DynamoDB to quickly find the Items that satisfy our Query.
A second way to handle this is with filtering based on non-key attributes. This is less efficient than Key Expressions but can still be helpful in the right situations. Filter expressions are used to apply server-side filters on Item attributes before they are returned to the client making the call. Filtering is Applied after DynamoDB Query is completed . If you retrieve 100KB of data in Query step but filter it down to 1KB of data, you will consume the Read Capacity Units for 100KB of data
Moral is - Filtering and projection expressions aren't a magic bullet - they won't make it easy to quickly query your data in additional ways. However, they can save network transfer time by limiting the number and size of items transferred back to your network. They can also simplify application complexity by pre-filtering your results rather than requiring application-side filtering.
From dynamodbguide
dynamodbguide

DynamoDB - Partition grouping or sharding?

So, looking through the DynamoDB docs, they'll often recommend that you "group" togheter items that are related in the same partition, as so to better distribute your partition usage.
Take the following example where we have an user that has contacts and invoices inside its partition :
So, if I need all of user_001's invoice I will simply query (pseudo):
QUERY WHERE PartitionKey = "user_001" AND SortKey.begins_with("invoice_")
But I recently noticed there's quite an issue when you use the method above.
You see, DynamoDB will search inside the whole user_001 partition for the invoices, and will consume read capacity based on all items searched, whether they where invoices or not.
This can be end up being very inefficient if you have a partition that is too big, let's say I had 10,000 contacts and 2 invoices, it could end up being very costly to get those 2 invoices.
I'm assuming this based on the quote by the docs :
DynamoDB calculates the number of read capacity units consumed based on
item size, not on the amount of data that is returned to an
application
The solution :
Wouldn't this be a better approach?
1) It shards the data better so I don't need to use starts_with
2) It allows me to use a time-based uuid as the sort key and enable more complex ordering/pagination
3) I will consume much less capacity on queries since it won't have to go through items I don't need
What's the question?
Well, what I said above is just theories and assumptions, the documentation doesn't make it clear how it really works behind the scene, and it even recommends picture 1 to be used.
But I'm really thinking picture 2 it's the best here, specially when you consider that now DynamoDB smartly distributes capacity throughout your partitions (and not evenly like it used to be)
So, are my points for thinking picture 2 being much better than 1 valid?
You have assumed incorrectly—the documentation you have quoted applies to filter expressions.
If you have a condition that applies to your sort key, that should be part of the query expression, not a filter expression.

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)

How do I optimize my DynamoDB table secondary global index so that records are evenly distributed while still keeping all records sortable?

Related to this question, I'm looking for more a more specific answer. In an effort to keep this non-subjective, here is a full thought process for creating an activities table with a stuck point that can be finished with a quick example answer.
In an effort to better understand DynamoDB, I'm creating a personal website that contains an activity feed from a DynamoDB table. The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
Different types of activities will include blog posts, projects, twitter post references, LinkedIn post references, etc. Using the activity type as a partition key would not be wise as my activity is highly weighted, mostly on the twitter side, hardly ever creating blog posts.
A unique activity id seems to be the best option for evenly distributing activities across DynamoDB partitions. However, this completely removes the ability to sort activities to start, as queries require a partition id to be known first. This is where a secondary global index (SGI) will be helpful. With this, a sort key will not be required on the primary partition key, but paired in an SGI.
This is part where I'm stuck. What do I base the SGI partition key on? At the moment I'm thinking of a single value "activity" for all activities with a sort key of "date", but that is a single partition for all entries. Will a single SGI partition key value limit performance in this project?
Note that this is a small scale project. However, I'm thinking about large scale projects while building this one, attempting to create the best DynamoDB table possible in regards to optimized partition distribution, while still keeping it flexible for sorting all table records.
Consider GSI (Global Secondary Index) same as Main Table indexes while designing your schema as they also get Read/Write provisioning limits and are subject to hot partition throttling as well which back pressures on main table in other words if your GSI gets throttled then your main table will start throttling requests.
Will a single SGI partition key value limit performance in this project?
Single partition for complete table is definitely misuse of DDB scalable capability.
The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
You can sort across partitions using GSI but you will again need partition key for your GSI and if that partition key is not distributed enough then you get into problems I mentioned above.
DDB is powerful for put/get operations if modeled right and for fairly simple queries with some filters. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
For your specific need its not directly possible to get scalable solution from DDB but we still have few options
Option 1:
We can model the data such that it is fairly distributed for writes and will need extra work while reading it back, this pattern is also known as Randomizing Across Multiple Partition Key Values. Since you don't want to access specific item for given time this will work for us.
Idea is to create fixed set (say 1 to 100) and randomly pick a number from it to append to creation date (not timestamp) and have creation timestamps as sort key.
This will distribute your load across multiple random partitions but increases the read complexity as you will need to query all partitions and merge to get final sort view for that date.
Option 2:
Use multiple tables for hot and cold data as it is time series based data. For info read
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
Option 3:
Scan? Not a good choice if we talk about scalability and when your data grows but for fairly small set of data it surely helps so mentioning it.
These are just an example not saying a good fit for your usecase.
So here is a thought process question for you: write down all your use-cases and access patterns. Figure out their importance which are fine with eventual consistency which are not and see if DDB is good fit for them at first place, don't be tempted to use DDB and then struggling with access pattern scalability.
Also read https://stackoverflow.com/a/38790120/962545 for more questions you must be asking yourself before restricting yourself for specific access pattern you want from DDB.
Don't forget to read best practices: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Indexing notifications table in DynamoDB

I am going to implement a notification system, and I am trying to figure out a good way to store notifications within a database. I have a web application that uses a PostgreSQL database, but a relational database does not seem ideal for this use case; I want to support various types of notifications, each including different data, though a subset of the data is common for all types of notifications. Therefore I was thinking that a NoSQL database is probably better than trying to normalize a schema in a relational database, as this would be quite tricky.
My application is hosted in Amazon Web Services (AWS), and I have been looking a bit at DynamoDB for storing the notifications. This is because it is managed, so I do not have to deal with the operations of it. Ideally, I'd like to have used MongoDB, but I'd really prefer not having to deal with the operations of the database myself. I have been trying to come up with a way to do what I want in DynamoDB, but I have been struggling, and therefore I have a few questions.
Suppose that I want to store the following data for each notification:
An ID
User ID of the receiver of the notification
Notification type
Timestamp
Whether or not it has been read/seen
Meta data about the notification/event (no querying necessary for this)
Now, I would like to be able to query for the most recent X notifications for a given user. Also, in another query, I'd like to fetch the number of unread notifications for a particular user. I am trying to figure out a way that I can index my table to be able to do this efficiently.
I can rule out simply having a hash primary key, as I would not be doing lookups by simply a hash key. I don't know if a "hash and range primary key" would help me here, as I don't know which attribute to put as the range key. Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key? Then perhaps a secondary index could help me to sort by the timestamp, if this is even possible.
I also looked at global secondary indexes, but the problem with these are that when querying the index, DynamoDB can only return attributes that are projected into the index - and since I would want all attributes to be returned, then I would effectively have to duplicate all of my data, which seems rather ridiculous.
How can I index my notifications table to support my use case? Is it even possible, or do you have any other recommendations?
Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact
your performance, scalability, and financial costs. It is different
than working with a local database because you pay not only for the
data that you store but also the operations that you perform against
the data. Deleting a record is a WRITE operation for example, so if
you don't have an efficient plan for clean up (and your case being
Time Series Data specially needs one), you will pay the price. Your
Data Model will not show problems when dealing with small data volume
but can definitely ruin your plans when you need to scale. That being
said, decisions like creating (or not) an index, defining proper
attributes for your keys, creating table segmentation, and etc will
make the entire difference down the road. Choosing DynamoDB (or more
generically speaking, a Key-Value store) as any other architectural
decision comes with a trade-off, you need to clearly understand
certain concepts about the Storage Model to be able to use the tool
efficiently, choosing the right keys is indeed important but only the
tip of the iceberg. For example, if you overlook the fact that you are
dealing with Time Series Data, no matter what primary keys or index
you define, your provisioned throughput will not be optimized because
it is spread throughout your entire table (and its partitions) and NOT
ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is
directly impacting your throughput just because it is part of the same
table. This leads to cases where the
ProvisionedThroughputExceededException is thrown "unexpectedly" when
you know for sure that your provisioned throughput should be enough for your
demand, however, the TABLE PARTITION that is being unevenly accessed
has reached its limits (more details here).
The post below has more details, but I wanted to give you some motivation to read through it and understand that although you can certainly find an easier solution for now, it might mean starting from the scratch in the near future when you hit a wall (the "wall" might come as high financial costs, limitations on performance and scalability, or a combination of all).
Q: Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key?
A: DynamoDB is a Key-Value storage meaning that the most efficient queries use the entire Key (Hash or Hash-Range). Using the Scan operation to actually perform a query just because you don't have your Key is definitely a sign of deficiency in your Data Model in regards to your requirements. There are a few things to consider and many options to avoid this problem (more details below).
Now before moving on, I would suggest you reading this quick post to clearly understand the difference between Hash Key and Hash+Range Key:
DynamoDB: When to use what PK type?
Your case is a typical Time Series Data scenario where your records become obsolete as the time goes by. There are two main factors you need to be careful about:
Make sure your tables have even access patterns
If you put all your notifications in a single table and the most recent ones are accessed more frequently, your provisioned throughput will not be used efficiently.
You should group the most accessed items in a single table so the provisioned throughput can be properly adjusted for the required access. Additionally, make sure you properly define a Hash Key that will allow even distribution of your data across multiple partitions.
The obsolete data is deleted with the most efficient way (effort, performance and cost wise)
The documentation suggests segmenting the data in different tables so you can delete or backup the entire table once the records become obsolete (see more details below).
Here is the section from the documentation that explains best practices related to Time Series Data:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
For example, You could have your tables segmented by month:
Notifications_April, Notifications_May, etc
Q: I would like to be able to query for the most recent X notifications for a given user.
A: I would suggest using the Query operation and querying using only the Hash Key (UserId) having the Range Key to sort the notifications by the Timestamp (Date and Time).
Hash Key: UserId
Range Key: Timestamp
Note: A better solution would be the Hash Key to not only have the UserId but also another concatenated information that you could calculate before querying to make sure your Hash Key grants you even access patterns to your data. For example, you can start to have hot partitions if notifications from specific users are more accessed than others... having an additional information in the Hash Key would mitigate this risk.
Q: I'd like to fetch the number of unread notifications for a particular user.
A: Create a Global Secondary Index as a Sparse Index having the UserId as the Hash Key and Unread as the Range Key.
Example:
Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread
When you query this index by Hash Key (UserId) you would automatically have all unread notifications with no unnecessary scans through notifications which are not relevant to this case. Keep in mind that the original Primary Key from the table is automatically projected into the index, so in case you need to get more information about the notification you can always resort to those attributes to perform a GetItem or BatchGetItem on the original table.
Note: You can explore the idea of using different attributes other than the 'Unread' flag, the important thing is to keep in mind that a Sparse Index can help you on this Use Case (more details below).
Detailed Explanation:
I would have a sparse index to make sure that you can query a reduced dataset to do the count. In your case you can have an attribute "unread" to flag if the notification was read or not, and use that attribute to create the Sparse Index. When the user reads the notification you simply remove that attribute from the notification so it doesn't show up in the index anymore. Here are some guidelines from the documentation that clearly apply to your scenario:
Take Advantage of Sparse Indexes
For any item in a table, DynamoDB will only write a corresponding
index entry if the index range key
attribute value is present in the item. If the range key attribute
does not appear in every table item, the index is said to be sparse.
[...]
To track open orders, you can create an index on CustomerId (hash) and
IsOpen (range). Only those orders in the table with IsOpen defined
will appear in the index. Your application can then quickly and
efficiently find the orders that are still open by querying the index.
If you had thousands of orders, for example, but only a small number
that are open, the application can query the index and return the
OrderId of each open order. Your application will perform
significantly fewer reads than it would take to scan the entire
CustomerOrders table. [...]
Instead of writing an arbitrary value into the IsOpen attribute, you
can use a different attribute that will result in a useful sort order
in the index. To do this, you can create an OrderOpenDate attribute
and set it to the date on which the order was placed (and still delete
the attribute once the order is fulfilled), and create the OpenOrders
index with the schema CustomerId (hash) and OrderOpenDate (range).
This way when you query your index, the items will be returned in a
more useful sort order.[...]
Such a query can be very efficient, because the number of items in the
index will be significantly fewer than the number of items in the
table. In addition, the fewer table attributes you project into the
index, the fewer read capacity units you will consume from the index.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes
Find below some references to the operations that you will need to programmatically create and delete tables:
Create Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html
Delete Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html
I'm an active user of DynamoDB and here is what I would do... Firstly, I'm assuming that you need to access notifications individually (e.g. to mark them as read/seen), in addition to getting the latest notifications by user_id.
Table design:
NotificationsTable
id - Hash key
user_id
timestamp
...
UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id
When you query the UserNotificationsIndex, you set the user_id of the user whose notifications you want and ScanIndexForward to false, and DynamoDB will return the notification ids for that user in reverse chronological order. You can optionally set a limit on how many results you want returned, or get a max of 1 MB.
With regards to projecting attributes, you'll either have to project the attributes you need into the index, or you can simply project the id and then write "hydrate" functionality in your code that does a look up on each ID and returns the specific fields that you need.
If you really don't like that, here is an alternate solution for you... Set your id as your timestamp. For example, I would use the # of milliseconds since a custom epoch (e.g. Jan 1, 2015). Here is an alternate table design:
NotificationsTable
user_id - Hash key
id/timestamp - Range key
Now you can query the NotificationsTable directly, setting the user_id appropriately and setting ScanIndexForward to false on the sort of the Range key. Of course, this assumes that you won't have a collision where a user gets 2 notifications in the same millisecond. This should be unlikely, but I don't know the scale of your system.